UMBC Data Science
Permanent URI for this collectionhttp://hdl.handle.net/11603/22728
The Data Science graduate program at UMBC prepares students to respond to the growing demand for professionals with data science knowledge, skills, and abilities. Our program brings together faculty from a wide range of fields who have a deep understanding of the real-world applications of data analytics. UMBC’s Data Science programs prepare students to excel in data science roles through hands-on experience, rigorous academics, and access to a robust network of knowledgeable industry professionals.
Browse
Recent Submissions
Item Regional Air Mobility Flight Demand Modeling in Tennessee State(2024-12-11) Acharya, Kamal; Lad, Mehul; Song, Houbing; Sun, LiangAdvanced Air Mobility (AAM), encompassing Urban Air Mobility (UAM) and Regional Air Mobility (RAM), offers innovative solutions to mitigate the issues related to ground transportation like traffic congestion, environmental pollution etc. RAM addresses transportation inefficiencies over medium-distance trips (50-500 miles), which are often underserved by both traditional air and ground transportation systems. This study focuses on RAM in Tennessee, addressing the complexities of demand modeling as a critical aspect of effective RAM implementation. Leveraging datasets from the Bureau of Transportation Statistics (BTS), Internal Revenue Service (IRS), Federal Aviation Administration (FAA), and other sources, we assess trip data across Tennessee's Metropolitan Statistical Areas (MSAs) to develop a predictive framework for RAM demand. Through cost, time, and risk regression, we calculate a Generalized Travel Cost (GTC) that allows for comparative analysis between ground transportation and RAM, identifying factors that influence mode choice. When focusing on only five major airports (BNA, CHA, MEM, TRI, and TYS) as RAM hubs, the results reveal a mixed demand pattern due to varying travel distances to these central locations, which increases back-and-forth travel for some routes. However, by expanding the RAM network to include more regional airports, the GTC for RAM aligns more closely with traditional air travel, providing a smoother and more competitive option against ground transportation, particularly for trips exceeding 300 miles. The analysis shows that RAM demand is likely to be selected when air transportation accounts for more than 80\% of the total GTC, air travel time is more than 1 hour and when the ground GTC exceeds 300 for specific origin-destination pairs. The data and code can be accessed on GitHub. {https://github.com/lotussavy/AIAAScitecth-2025.git}Item An Investigation of the Relationship Between Crime Rate and Police Compensation(2024-11-21) Amarsingh, Jhancy; Appakondreddigari, Likhith Kumar Reddy; Nunna, Ashish; Tummala, Charishma Choudary; Winship, John; Zhou, Alex; Ashqar, HuthaifaThe goal of this paper is to assess whether there is any correlation between police salaries and crime rates. Using public data sources that contain Baltimore Crime Rates and Baltimore Police Department (BPD) salary information from 2011 to 2021, our research uses a variety of techniques to capture and measure any correlation between the two. Based on that correlation, the paper then uses established social theories to make recommendations on how this data can potentially be used by State Leadership. Our initial results show a negative correlation between salary/compensation levels and crime rates.Item Optimizing Daily Fantasy Baseball Lineups: A Linear Programming Approach for Enhanced Accuracy(2024-11-17) Grody, Max; Bansal, Sandeep; Ashqar, HuthaifaDaily fantasy baseball has shortened the life cycle of an entire fantasy season into a single day. As of today, it has become familiar with more than 10 million people around the world who participate in online fantasy. As daily fantasy continues to grow, the importance of selecting a winning lineup becomes more valuable. The purpose of this paper is to determine how accurate FanDuel current daily fantasy strategy of optimizing daily lineups are and utilize python and linear programming to build a lineup optimizer for daily fantasy sports with the goal of proposing a more accurate model to assist daily fantasy participants select a winning lineup.Item Flood Risk Assessment of the National Harbor at Maryland, United States(2024-11-17) Negussie, Neftalem; Yesserie, Addis; Harris, Chinchu; Keita, Abou; Ashqar, HuthaifaOver the past few decades, floods have become one of the costliest natural hazards and losses have sharply escalated. Floods are an increasing problem in urban areas due to increased residential settlement along the coastline and climate change is a contributing factor to this increased frequency. In order to analyze flood risk, a model is proposed to identify the factors associated with increased flooding at a local scale. The study area includes National Harbor, MD, and the surrounding area of Fort Washington. The objective is to assess flood risk due to an increase in sea level rise for the study area of interest. The study demonstrated that coastal flood risk increased with sea level rise even though the predicted level of impact is fairly insignificant for the study area. The level of impact from increased flooding is highly dependent on the location of the properties and other topographic information.Item Impact of Covid-19 on Taxi Industry and Travel Behavior: A Case Study on Chicago, IL(2024-11-12) Chinthala, Naga Sireesha; Lewis, Jenell; Vuppalapati, Sravan; Sivaraman, Khiran Kumar Chidambaram; Toley, Chinmay Vivek; Ashqar, HuthaifaAs the debate over the future of transportation continues in the midst of the COVID-19 pandemic as a deepening global crisis, taxi industry seems to be not spared by the quick and disrupting changes that may arise from the pandemic. The impact is relatively higher in major cities because of the high-density population and transportation congestion. In this study, we used spatial analysis and visualization to investigate the impact of the pandemic on the economics of the taxi industry and travel behavior using trip-by-trip data from the year of 2014 to 2020 in Chicago, IL. Results show that there is a drastic decline in the trips in the central city and airport areas. During the pandemic, people tended to travel longer distances, but travel times were considerably less because of the significant reduction in traffic volumes. Results also showed that the top twenty most popular pick-up and drop-off locations only included Chicago Downtown and OHare International Airport before the pandemic. However, during the pandemic, the top twenty most popular pick-up and drop-off locations is distributed between the Airport, the Downtown, as well as many other areas along Chicago Eastside.Item ALDAS: Audio-Linguistic Data Augmentation for Spoofed Audio Detection(2024-10-21) Khanjani, Zahra; Mallinson, Christine; Foulds, James; Janeja, VandanaSpoofed audio, i.e. audio that is manipulated or AI-generated deepfake audio, is difficult to detect when only using acoustic features. Some recent innovative work involving AI-spoofed audio detection models augmented with phonetic and phonological features of spoken English, manually annotated by experts, led to improved model performance. While this augmented model produced substantial improvements over traditional acoustic features based models, a scalability challenge motivates inquiry into auto labeling of features. In this paper we propose an AI framework, Audio-Linguistic Data Augmentation for Spoofed audio detection (ALDAS), for auto labeling linguistic features. ALDAS is trained on linguistic features selected and extracted by sociolinguistics experts; these auto labeled features are used to evaluate the quality of ALDAS predictions. Findings indicate that while the detection enhancement is not as substantial as when involving the pure ground truth linguistic features, there is improvement in performance while achieving auto labeling. Labels generated by ALDAS are also validated by the sociolinguistics experts.Item The effect of different feature selection methods on models created with XGBoost(2024-11-08) Neyra, Jorge; Siramshetty, Vishal B.; Ashqar, HuthaifaThis study examines the effect that different feature selection methods have on models created with XGBoost, a popular machine learning algorithm with superb regularization methods. It shows that three different ways for reducing the dimensionality of features produces no statistically significant change in the prediction accuracy of the model. This suggests that the traditional idea of removing the noisy training data to make sure models do not overfit may not apply to XGBoost. But it may still be viable in order to reduce computational complexity.Item When to Commute During the COVID-19 Pandemic and Beyond: Analysis of Traffic Crashes in Washington, D.C(2024-11-08) Choi, Joanne; Clark, Sam; Jaiswal, Ranjan; Kirk, Peter; Jayaraman, Sachin; Ashqar, HuthaifaMany workers in cities across the world, who have been teleworking because of the COVID-19 pandemic, are expected to be back to their commutes. As this process is believed to be gradual and telecommuting is likely to remain an option for many workers, hybrid model and flexible schedules might become the norm in the future. This variable work schedules allows employees to commute outside of traditional rush hours. Moreover, many studies showed that commuters might be skeptical of using trains, buses, and carpools and could turn to personal vehicles to get to work, which might increase congestion and crashes in the roads. This study attempts to provide information on the safest time to commute to Washington, DC area analyzing historical traffic crash data before the COVID-19 pandemic. It also aims to advance our understanding of traffic crashes and other relating factors such as weather in the Washington, DC area. We created a model to predict crashes by time of the day, using a negative binomial regression after rejecting a Poisson regression, and additionally explored the validity of a Random Forest regression. Our main consideration for an eventual application of this study is to reduce crashes in Washington DC, using this tool that provides people with better options on when to commute and when to telework, if available. The study also provides policymakers and researchers with real-world insights that decrease the number of traffic crashes to help achieve the goals of The Vision Zero Initiative adopted by the district.Item Is Function Similarity Over-Engineered? Building a Benchmark(2024-10-30) Saul, Rebecca; Liu, Chang; Fleischmann, Noah; Zak, Richard; Micinski, Kristopher; Raff, Edward; Holt, JamesBinary analysis is a core component of many critical security tasks, including reverse engineering, malware analysis, and vulnerability detection. Manual analysis is often time-consuming, but identifying commonly-used or previously-seen functions can reduce the time it takes to understand a new file. However, given the complexity of assembly, and the NP-hard nature of determining function equivalence, this task is extremely difficult. Common approaches often use sophisticated disassembly and decompilation tools, graph analysis, and other expensive pre-processing steps to perform function similarity searches over some corpus. In this work, we identify a number of discrepancies between the current research environment and the underlying application need. To remedy this, we build a new benchmark, REFuSE-Bench, for binary function similarity detection consisting of high-quality datasets and tests that better reflect real-world use cases. In doing so, we address issues like data duplication and accurate labeling, experiment with real malware, and perform the first serious evaluation of ML binary function similarity models on Windows data. Our benchmark reveals that a new, simple basline, one which looks at only the raw bytes of a function, and requires no disassembly or other pre-processing, is able to achieve state-of-the-art performance in multiple settings. Our findings challenge conventional assumptions that complex models with highly-engineered features are being used to their full potential, and demonstrate that simpler approaches can provide significant value.Item SERN: Simulation-Enhanced Realistic Navigation for Multi-Agent Robotic Systems in Contested Environments(2024-10-22) Hossain, Jumman; Dey, Emon; Chugh, Snehalraj; Ahmed, Masud; Anwar,Mohammad Saeid; Faridee, Abu Zaher Md; Hoppes, Jason; Trout, Theron; Basak, Anjon; Chowdhury, Rafidh; Mistry, Rishabh; Kim, Hyun; Freeman, Jade; Suri, Niranjan; Raglin, Adrienne; Busart, Carl; Gregory, Timothy; Ravi, Anuradha; Roy, NirmalyaThe increasing deployment of autonomous systems in complex environments necessitates efficient communication and task completion among multiple agents. This paper presents SERN (Simulation-Enhanced Realistic Navigation), a novel framework integrating virtual and physical environments for real-time collaborative decision-making in multi-robot systems. SERN addresses key challenges in asset deployment and coordination through a bi-directional communication framework using the AuroraXR ROS Bridge. Our approach advances the SOTA through accurate real-world representation in virtual environments using Unity high-fidelity simulator; synchronization of physical and virtual robot movements; efficient ROS data distribution between remote locations; and integration of SOTA semantic segmentation for enhanced environmental perception. Our evaluations show a 15% to 24% improvement in latency and up to a 15% increase in processing efficiency compared to traditional ROS setups. Real-world and virtual simulation experiments with multiple robots demonstrate synchronization accuracy, achieving less than 5 cm positional error and under 2-degree rotational error. These results highlight SERN's potential to enhance situational awareness and multi-agent coordination in diverse, contested environments.Item Neural Normalized Compression Distance and the Disconnect Between Compression and Classification(2024-10-20) Hurwitz, John; Nicholas, Charles; Raff, EdwardIt is generally well understood that predictive classification and compression are intrinsically related concepts in information theory. Indeed, many deep learning methods are explained as learning a kind of compression, and that better compression leads to better performance. We interrogate this hypothesis via the Normalized Compression Distance (NCD), which explicitly relies on compression as the means of measuring similarity between sequences and thus enables nearest-neighbor classification. By turning popular large language models (LLMs) into lossless compressors, we develop a Neural NCD and compare LLMs to classic general-purpose algorithms like gzip. In doing so, we find that classification accuracy is not predictable by compression rate alone, among other empirical aberrations not predicted by current understanding. Our results imply that our intuition on what it means for a neural network to ``compress'' and what is needed for effective classification are not yet well understood.Item Identifying Economic Factors Affecting Unemployment Rates in the United States(2024-11-04) Green, Alrick; Nasim, Ayesha; Radadia, Jaydeep; Kallam, Devi Manaswi; Kalyanam, Viswas; Owenga, Samfred; Ashqar, HuthaifaIn this study, we seek to understand how macroeconomic factors such as GDP, inflation, Unemployment Insurance, and S&P 500 index; as well as microeconomic factors such as health, race, and educational attainment impacted the unemployment rate for about 20 years in the United States. Our research question is to identify which factor(s) contributed the most to the unemployment rate surge using linear regression. Results from our studies showed that GDP (negative), inflation (positive), Unemployment Insurance (contrary to popular opinion; negative), and S&P 500 index (negative) were all significant factors, with inflation being the most important one. As for health issue factors, our model produced resultant correlation scores for occurrences of Cardiovascular Disease, Neurological Disease, and Interpersonal Violence with unemployment. Race as a factor showed a huge discrepancies in the unemployment rate between Black Americans compared to their counterparts. Asians had the lowest unemployment rate throughout the years. As for education attainment, results showed that having a higher education attainment significantly reduced one chance of unemployment. People with higher degrees had the lowest unemployment rate. Results of this study will be beneficial for policymakers and researchers in understanding the unemployment rate during the pandemic.Item The Impact of Medicaid Expansion on Medicare Quality Measures(2024-11-05) Algrain, Hala; Cardosa, Elizabeth; Desai, Shekha; Fong, Eugene; Ringoir, Tanguy; Ashqar, HuthaifaThe Affordable Care Act was signed into law in 2010, expanding Medicaid and improving access to care for millions of low-income Americans. Fewer uninsured individuals reduced the cost of uncompensated care, consequently improving the financial health of hospitals. We hypothesize that this amelioration in hospital finances resulted in a marked improvement of quality measures in states that chose to expand Medicaid. To our knowledge, the impact of Medicaid expansion on the Medicare population has not been investigated. Using a difference-in-difference analysis, we compare readmission rates for four measures from the Hospital Readmission Reduction Program: acute myocardial infarction, pneumonia, heart failure, and coronary artery bypass graft surgery. Our analysis provides evidence that between 2013 and 2021 expansion states improved hospital quality relative to non-expansion states as it relates to acute myocardial infarction readmissions (p = 0.015) and coronary artery bypass graft surgery readmissions (p = 0.039). Our analysis provides some evidence that expanding Medicaid improved hospital quality, as measured by a reduction in readmission rates. Using visualizations, we provide some evidence that hospital quality improved for the other two measures as well. We believe that a refinement of our estimation method and an improved dataset will increase our chances of finding significant results for these two other measures.Item The Effect of Funding on Student Achievement: Evidence from District of Columbia, Virginia, and Maryland(2024-11-05) Raabe, Adam; Reynolds, Jessica; Kukudala, Akshitha; Ashqar, HuthaifaThe question of how to best serve the student populations of our country is a complex topic. Since public funding is limited, we must explore the best ways to direct the money to improve student outcomes. Previous research has suggested that socio-economic status is the best predictor of student achievement, while other studies suggest that the amount of money spent on the student is a more significant factor. In this paper, we explore this question and its impacts on Maryland, Virginia, and the District of Columbia schools. We conclude that the graduation rate has a direct relationship with unemployment, suggesting that funding towards improving out-of-school opportunities and quality of life will significantly improve students chances of success. We do not find a significant relationship between per-pupil spending and student achievement.Item ALDAS: Audio-Linguistic Data Augmentation for Spoofed Audio Detection(2024-10-21) Khanjani, Zahra; Mallinson, Christine; Foulds, James; Janeja, VandanaSpoofed audio, i.e. audio that is manipulated or AI-generated deepfake audio, is difficult to detect when only using acoustic features. Some recent innovative work involving AI-spoofed audio detection models augmented with phonetic and phonological features of spoken English, manually annotated by experts, led to improved model performance. While this augmented model produced substantial improvements over traditional acoustic features based models, a scalability challenge motivates inquiry into auto labeling of features. In this paper we propose an AI framework, Audio-Linguistic Data Augmentation for Spoofed audio detection (ALDAS), for auto labeling linguistic features. ALDAS is trained on linguistic features selected and extracted by sociolinguistics experts; these auto labeled features are used to evaluate the quality of ALDAS predictions. Findings indicate that while the detection enhancement is not as substantial as when involving the pure ground truth linguistic features, there is improvement in performance while achieving auto labeling. Labels generated by ALDAS are also validated by the sociolinguistics experts.Item Predictive Maintenance of Urban Metro Vehicles: Classification of Air Production Unit Failures Using Machine Learning(2023-03) Najjar, Ayat; Ashqar, Huthaifa; Hasasneh, AhmadPredictive maintenance methods assist early detection of failures and errors in machinery before they reach critical stages. Predictive maintenance methods assist early detection of failures and errors in machinery before they reach critical stages. Predictive maintenance methods assist early detection of failures and errors in machinery before they reach critical stages. Predictive maintenance (PdM) is crucial for companies to avoid unplanned outages, increase overall reliability, and lower operating costs. Failure detection and classification is a key element of predictive maintenance. In this study, a novel framework for identifying failures in the Air Production Unit (APU) of metro vehicles in real-time was proposed. The framework can also be used to create a recommendation system for predicting APU failures. To the best of our knowledge, this is the first study that detect and classify the failures in APU's metro vehicle using a real-time approach that includes machine learning. Analog sensors were found to be more significant than digital sensors in providing real-time, continuous data that is crucial for maintaining safe and efficient train operation. The proposed framework resulted in promising results with the highest F-Score of about 85% for the binary classifier and 97% for the multiclassification using the RF algorithm on the MetroPT dataset. The framework can be beneficial for metro operators by reducing maintenance costs, increasing safety, improving reliability, better managing assets, and enhancing the passenger experience. By predicting when maintenance is needed, operators can address potential safety issues before they become serious problems, improve the reliability of the metro system, and reduce disruptions for passengers. The most important analog sensor-based features include the pressure within the trains' installed air tanks, oil temperature on the compressor, and flowmeter values. The proposed framework is applicable in the field and can help operators make more informed decisions about when to repair or replace assets.Item Road sign classification using deep learning(National Academy of Sciences, 2023-09) Ashour, Karim; Nafaa, Selvia; Emad, Doaa; Mohamed, Rana; Essam, Hafsa; Elhenawy, Mohammed; Ashqar, Huthaifa; Hassan, Abdallah A.; Glaser, Sebastien; Rakotonirainy, AndryRoad sign classification is essential for safety, especially with the development of autonomous vehicles and automated road asset management. Road sign classification is challenging because of several factors, including lighting, weather conditions, motion blur and car vibration. In this study, we developed an ensemble of fine-tuned pre-trained CCN networks. We used the German Traffic Sign Recognition Benchmark (GTSRB) to train and test the proposed ensemble. The proposed ensemble yielded a preliminary testing accuracy of 96.8%. Consequently, we customized the architecture of the worst-performing network in the ensemble, which boosted the accuracy to 99%.Item Intersection detection using vehicle trajectories data: Deep Neural Network application(National Academy of Sciences, 2023-09) Kased, Abanoub; Rabee, Rana; Fahmy, Akram; Mohamed, Hussien; Yacoub, Marco; Elhenawy, Mohammed; Ashqar, Huthaifa; Hassan, Abdallah A.; Glaser, Sebastien; Rakotonirainy, AndryIn 2021, intersection-adjacent crashes were stated to cause 7.7% of total annual road deaths in Australia (BITRE, n.d.). Generating intersection maps is essential for future Cooperative Intelligent Transport Systems (C-ITS) deployment. Nonetheless, crowdsourced vehicle trajectories are a viable and affordable data source that can be used to generate maps. However, intersection maps are changeable, and building one map inference model for all intersection types is challenging. Therefore, we need an object detector that can detect and classify the different intersections using the 2-D scatter plot of the crowdsourced trajectories. Consequently, each subset of trajectories data points passed to the suitable intersection map inference model. This study used two real-world vehicle trajectory datasets, T-Drive and ECML-PKDD 15, to classify the intersections by building an object detection model using Deep Neural Network (DNN). We created 2000 images to train a Single-Shot detector the initial testing results were promising.Item Deep Learning-Based pavement defect detection(National Academy of Sciences, 2023-09) Mohamed, R.; Esam, H.; Nafaa, S.; Ashour, K.; Emad, D.; Elhenawy, M.; Ashqar, Huthaifa; Hassan, A. A.; Glaser, S.; Rakotonirainy, A.Pavement defects can significantly impact road safety, and detecting and repairing these defects is important. However, pavement defects detection by humans is time-consuming. With the advances in information and communication technology, many vehicles on the road are fitted with cameras, generating massive, crowdsourced data. This study demonstrates the usage of deep learning and computer vision to identify and classify pavement defects. We used the Road Damage Dataset 2022 (Arya et al., 2022) to train and test different object detectors, ensuring accurate and reliable detection. The initial results showed that it is possible to identify and classify pavement defects efficiently with results of 80% mAP50, reducing the risk of accidents, in addition, using these methods can lead to cost savings in maintenance and repair expenses, as well as reduce the environmental impact of routine road surveys.Item Traffic Estimation of Various Connected Vehicle Penetration Rates: Temporal Convolutional Network Approach(IEEE, 2024-05) Ashqer, Mujahid; Ashqar, Huthaifa; Elhenawy, Mohammed; Rakha, Hesham A.; Bikdash, MarwanTraffic estimation using probe vehicle data is a crucial aspect of traffic management as it provides real-time information about traffic conditions. This study introduced a novel framework for traffic density estimation using Temporal Convolutional Network (TCN) for time series data. The study used two datasets collected from a three-leg intersection in Greece and a four-leg intersection in Germany. The model was built to predict the density in an approach of the signalized intersection using features extracted from the other approaches. The results showed that the highest accuracy was achieved when only probe vehicle data was used. This implies that relying solely on probe vehicle data from two approaches can effectively predict traffic density in the third approach, even when the Market Penetration Rate (MPR) is low. The results also indicated that having Signal Phase and Timing (SPaT) information may not be necessary for high accuracy in traffic estimation and that as the MPR increases, the model becomes more predictable.