UMBC Data Science
Permanent URI for this collectionhttp://hdl.handle.net/11603/22728
The Data Science graduate program at UMBC prepares students to respond to the growing demand for professionals with data science knowledge, skills, and abilities. Our program brings together faculty from a wide range of fields who have a deep understanding of the real-world applications of data analytics. UMBC’s Data Science programs prepare students to excel in data science roles through hands-on experience, rigorous academics, and access to a robust network of knowledgeable industry professionals.
Browse
Recent Submissions
Item The effect of different feature selection methods on models created with XGBoost(2024-11-08) Neyra, Jorge; Siramshetty, Vishal B.; Ashqar, HuthaifaThis study examines the effect that different feature selection methods have on models created with XGBoost, a popular machine learning algorithm with superb regularization methods. It shows that three different ways for reducing the dimensionality of features produces no statistically significant change in the prediction accuracy of the model. This suggests that the traditional idea of removing the noisy training data to make sure models do not overfit may not apply to XGBoost. But it may still be viable in order to reduce computational complexity.Item When to Commute During the COVID-19 Pandemic and Beyond: Analysis of Traffic Crashes in Washington, D.C(2024-11-08) Choi, Joanne; Clark, Sam; Jaiswal, Ranjan; Kirk, Peter; Jayaraman, Sachin; Ashqar, HuthaifaMany workers in cities across the world, who have been teleworking because of the COVID-19 pandemic, are expected to be back to their commutes. As this process is believed to be gradual and telecommuting is likely to remain an option for many workers, hybrid model and flexible schedules might become the norm in the future. This variable work schedules allows employees to commute outside of traditional rush hours. Moreover, many studies showed that commuters might be skeptical of using trains, buses, and carpools and could turn to personal vehicles to get to work, which might increase congestion and crashes in the roads. This study attempts to provide information on the safest time to commute to Washington, DC area analyzing historical traffic crash data before the COVID-19 pandemic. It also aims to advance our understanding of traffic crashes and other relating factors such as weather in the Washington, DC area. We created a model to predict crashes by time of the day, using a negative binomial regression after rejecting a Poisson regression, and additionally explored the validity of a Random Forest regression. Our main consideration for an eventual application of this study is to reduce crashes in Washington DC, using this tool that provides people with better options on when to commute and when to telework, if available. The study also provides policymakers and researchers with real-world insights that decrease the number of traffic crashes to help achieve the goals of The Vision Zero Initiative adopted by the district.Item Is Function Similarity Over-Engineered? Building a Benchmark(2024-10-30) Saul, Rebecca; Liu, Chang; Fleischmann, Noah; Zak, Richard; Micinski, Kristopher; Raff, Edward; Holt, JamesBinary analysis is a core component of many critical security tasks, including reverse engineering, malware analysis, and vulnerability detection. Manual analysis is often time-consuming, but identifying commonly-used or previously-seen functions can reduce the time it takes to understand a new file. However, given the complexity of assembly, and the NP-hard nature of determining function equivalence, this task is extremely difficult. Common approaches often use sophisticated disassembly and decompilation tools, graph analysis, and other expensive pre-processing steps to perform function similarity searches over some corpus. In this work, we identify a number of discrepancies between the current research environment and the underlying application need. To remedy this, we build a new benchmark, REFuSE-Bench, for binary function similarity detection consisting of high-quality datasets and tests that better reflect real-world use cases. In doing so, we address issues like data duplication and accurate labeling, experiment with real malware, and perform the first serious evaluation of ML binary function similarity models on Windows data. Our benchmark reveals that a new, simple basline, one which looks at only the raw bytes of a function, and requires no disassembly or other pre-processing, is able to achieve state-of-the-art performance in multiple settings. Our findings challenge conventional assumptions that complex models with highly-engineered features are being used to their full potential, and demonstrate that simpler approaches can provide significant value.Item SERN: Simulation-Enhanced Realistic Navigation for Multi-Agent Robotic Systems in Contested Environments(2024-10-22) Hossain, Jumman; Dey, Emon; Chugh, Snehalraj; Ahmed, Masud; Anwar,Mohammad Saeid; Faridee, Abu Zaher Md; Hoppes, Jason; Trout, Theron; Basak, Anjon; Chowdhury, Rafidh; Mistry, Rishabh; Kim, Hyun; Freeman, Jade; Suri, Niranjan; Raglin, Adrienne; Busart, Carl; Gregory, Timothy; Ravi, Anuradha; Roy, NirmalyaThe increasing deployment of autonomous systems in complex environments necessitates efficient communication and task completion among multiple agents. This paper presents SERN (Simulation-Enhanced Realistic Navigation), a novel framework integrating virtual and physical environments for real-time collaborative decision-making in multi-robot systems. SERN addresses key challenges in asset deployment and coordination through a bi-directional communication framework using the AuroraXR ROS Bridge. Our approach advances the SOTA through accurate real-world representation in virtual environments using Unity high-fidelity simulator; synchronization of physical and virtual robot movements; efficient ROS data distribution between remote locations; and integration of SOTA semantic segmentation for enhanced environmental perception. Our evaluations show a 15% to 24% improvement in latency and up to a 15% increase in processing efficiency compared to traditional ROS setups. Real-world and virtual simulation experiments with multiple robots demonstrate synchronization accuracy, achieving less than 5 cm positional error and under 2-degree rotational error. These results highlight SERN's potential to enhance situational awareness and multi-agent coordination in diverse, contested environments.Item Neural Normalized Compression Distance and the Disconnect Between Compression and Classification(2024-10-20) Hurwitz, John; Nicholas, Charles; Raff, EdwardIt is generally well understood that predictive classification and compression are intrinsically related concepts in information theory. Indeed, many deep learning methods are explained as learning a kind of compression, and that better compression leads to better performance. We interrogate this hypothesis via the Normalized Compression Distance (NCD), which explicitly relies on compression as the means of measuring similarity between sequences and thus enables nearest-neighbor classification. By turning popular large language models (LLMs) into lossless compressors, we develop a Neural NCD and compare LLMs to classic general-purpose algorithms like gzip. In doing so, we find that classification accuracy is not predictable by compression rate alone, among other empirical aberrations not predicted by current understanding. Our results imply that our intuition on what it means for a neural network to ``compress'' and what is needed for effective classification are not yet well understood.Item Identifying Economic Factors Affecting Unemployment Rates in the United States(2024-11-04) Green, Alrick; Nasim, Ayesha; Radadia, Jaydeep; Kallam, Devi Manaswi; Kalyanam, Viswas; Owenga, Samfred; Ashqar, HuthaifaIn this study, we seek to understand how macroeconomic factors such as GDP, inflation, Unemployment Insurance, and S&P 500 index; as well as microeconomic factors such as health, race, and educational attainment impacted the unemployment rate for about 20 years in the United States. Our research question is to identify which factor(s) contributed the most to the unemployment rate surge using linear regression. Results from our studies showed that GDP (negative), inflation (positive), Unemployment Insurance (contrary to popular opinion; negative), and S&P 500 index (negative) were all significant factors, with inflation being the most important one. As for health issue factors, our model produced resultant correlation scores for occurrences of Cardiovascular Disease, Neurological Disease, and Interpersonal Violence with unemployment. Race as a factor showed a huge discrepancies in the unemployment rate between Black Americans compared to their counterparts. Asians had the lowest unemployment rate throughout the years. As for education attainment, results showed that having a higher education attainment significantly reduced one chance of unemployment. People with higher degrees had the lowest unemployment rate. Results of this study will be beneficial for policymakers and researchers in understanding the unemployment rate during the pandemic.Item The Impact of Medicaid Expansion on Medicare Quality Measures(2024-11-05) Algrain, Hala; Cardosa, Elizabeth; Desai, Shekha; Fong, Eugene; Ringoir, Tanguy; Ashqar, HuthaifaThe Affordable Care Act was signed into law in 2010, expanding Medicaid and improving access to care for millions of low-income Americans. Fewer uninsured individuals reduced the cost of uncompensated care, consequently improving the financial health of hospitals. We hypothesize that this amelioration in hospital finances resulted in a marked improvement of quality measures in states that chose to expand Medicaid. To our knowledge, the impact of Medicaid expansion on the Medicare population has not been investigated. Using a difference-in-difference analysis, we compare readmission rates for four measures from the Hospital Readmission Reduction Program: acute myocardial infarction, pneumonia, heart failure, and coronary artery bypass graft surgery. Our analysis provides evidence that between 2013 and 2021 expansion states improved hospital quality relative to non-expansion states as it relates to acute myocardial infarction readmissions (p = 0.015) and coronary artery bypass graft surgery readmissions (p = 0.039). Our analysis provides some evidence that expanding Medicaid improved hospital quality, as measured by a reduction in readmission rates. Using visualizations, we provide some evidence that hospital quality improved for the other two measures as well. We believe that a refinement of our estimation method and an improved dataset will increase our chances of finding significant results for these two other measures.Item The Effect of Funding on Student Achievement: Evidence from District of Columbia, Virginia, and Maryland(2024-11-05) Raabe, Adam; Reynolds, Jessica; Kukudala, Akshitha; Ashqar, HuthaifaThe question of how to best serve the student populations of our country is a complex topic. Since public funding is limited, we must explore the best ways to direct the money to improve student outcomes. Previous research has suggested that socio-economic status is the best predictor of student achievement, while other studies suggest that the amount of money spent on the student is a more significant factor. In this paper, we explore this question and its impacts on Maryland, Virginia, and the District of Columbia schools. We conclude that the graduation rate has a direct relationship with unemployment, suggesting that funding towards improving out-of-school opportunities and quality of life will significantly improve students chances of success. We do not find a significant relationship between per-pupil spending and student achievement.Item ALDAS: Audio-Linguistic Data Augmentation for Spoofed Audio Detection(2024-10-21) Khanjani, Zahra; Mallinson, Christine; Foulds, James; Janeja, VandanaSpoofed audio, i.e. audio that is manipulated or AI-generated deepfake audio, is difficult to detect when only using acoustic features. Some recent innovative work involving AI-spoofed audio detection models augmented with phonetic and phonological features of spoken English, manually annotated by experts, led to improved model performance. While this augmented model produced substantial improvements over traditional acoustic features based models, a scalability challenge motivates inquiry into auto labeling of features. In this paper we propose an AI framework, Audio-Linguistic Data Augmentation for Spoofed audio detection (ALDAS), for auto labeling linguistic features. ALDAS is trained on linguistic features selected and extracted by sociolinguistics experts; these auto labeled features are used to evaluate the quality of ALDAS predictions. Findings indicate that while the detection enhancement is not as substantial as when involving the pure ground truth linguistic features, there is improvement in performance while achieving auto labeling. Labels generated by ALDAS are also validated by the sociolinguistics experts.Item Predictive Maintenance of Urban Metro Vehicles: Classification of Air Production Unit Failures Using Machine Learning(2023-03) Najjar, Ayat; Ashqar, Huthaifa; Hasasneh, AhmadPredictive maintenance methods assist early detection of failures and errors in machinery before they reach critical stages. Predictive maintenance methods assist early detection of failures and errors in machinery before they reach critical stages. Predictive maintenance methods assist early detection of failures and errors in machinery before they reach critical stages. Predictive maintenance (PdM) is crucial for companies to avoid unplanned outages, increase overall reliability, and lower operating costs. Failure detection and classification is a key element of predictive maintenance. In this study, a novel framework for identifying failures in the Air Production Unit (APU) of metro vehicles in real-time was proposed. The framework can also be used to create a recommendation system for predicting APU failures. To the best of our knowledge, this is the first study that detect and classify the failures in APU's metro vehicle using a real-time approach that includes machine learning. Analog sensors were found to be more significant than digital sensors in providing real-time, continuous data that is crucial for maintaining safe and efficient train operation. The proposed framework resulted in promising results with the highest F-Score of about 85% for the binary classifier and 97% for the multiclassification using the RF algorithm on the MetroPT dataset. The framework can be beneficial for metro operators by reducing maintenance costs, increasing safety, improving reliability, better managing assets, and enhancing the passenger experience. By predicting when maintenance is needed, operators can address potential safety issues before they become serious problems, improve the reliability of the metro system, and reduce disruptions for passengers. The most important analog sensor-based features include the pressure within the trains' installed air tanks, oil temperature on the compressor, and flowmeter values. The proposed framework is applicable in the field and can help operators make more informed decisions about when to repair or replace assets.Item Road sign classification using deep learning(National Academy of Sciences, 2023-09) Ashour, Karim; Nafaa, Selvia; Emad, Doaa; Mohamed, Rana; Essam, Hafsa; Elhenawy, Mohammed; Ashqar, Huthaifa; Hassan, Abdallah A.; Glaser, Sebastien; Rakotonirainy, AndryRoad sign classification is essential for safety, especially with the development of autonomous vehicles and automated road asset management. Road sign classification is challenging because of several factors, including lighting, weather conditions, motion blur and car vibration. In this study, we developed an ensemble of fine-tuned pre-trained CCN networks. We used the German Traffic Sign Recognition Benchmark (GTSRB) to train and test the proposed ensemble. The proposed ensemble yielded a preliminary testing accuracy of 96.8%. Consequently, we customized the architecture of the worst-performing network in the ensemble, which boosted the accuracy to 99%.Item Intersection detection using vehicle trajectories data: Deep Neural Network application(National Academy of Sciences, 2023-09) Kased, Abanoub; Rabee, Rana; Fahmy, Akram; Mohamed, Hussien; Yacoub, Marco; Elhenawy, Mohammed; Ashqar, Huthaifa; Hassan, Abdallah A.; Glaser, Sebastien; Rakotonirainy, AndryIn 2021, intersection-adjacent crashes were stated to cause 7.7% of total annual road deaths in Australia (BITRE, n.d.). Generating intersection maps is essential for future Cooperative Intelligent Transport Systems (C-ITS) deployment. Nonetheless, crowdsourced vehicle trajectories are a viable and affordable data source that can be used to generate maps. However, intersection maps are changeable, and building one map inference model for all intersection types is challenging. Therefore, we need an object detector that can detect and classify the different intersections using the 2-D scatter plot of the crowdsourced trajectories. Consequently, each subset of trajectories data points passed to the suitable intersection map inference model. This study used two real-world vehicle trajectory datasets, T-Drive and ECML-PKDD 15, to classify the intersections by building an object detection model using Deep Neural Network (DNN). We created 2000 images to train a Single-Shot detector the initial testing results were promising.Item Deep Learning-Based pavement defect detection(National Academy of Sciences, 2023-09) Mohamed, R.; Esam, H.; Nafaa, S.; Ashour, K.; Emad, D.; Elhenawy, M.; Ashqar, Huthaifa; Hassan, A. A.; Glaser, S.; Rakotonirainy, A.Pavement defects can significantly impact road safety, and detecting and repairing these defects is important. However, pavement defects detection by humans is time-consuming. With the advances in information and communication technology, many vehicles on the road are fitted with cameras, generating massive, crowdsourced data. This study demonstrates the usage of deep learning and computer vision to identify and classify pavement defects. We used the Road Damage Dataset 2022 (Arya et al., 2022) to train and test different object detectors, ensuring accurate and reliable detection. The initial results showed that it is possible to identify and classify pavement defects efficiently with results of 80% mAP50, reducing the risk of accidents, in addition, using these methods can lead to cost savings in maintenance and repair expenses, as well as reduce the environmental impact of routine road surveys.Item Traffic Estimation of Various Connected Vehicle Penetration Rates: Temporal Convolutional Network Approach(IEEE, 2024-05) Ashqer, Mujahid; Ashqar, Huthaifa; Elhenawy, Mohammed; Rakha, Hesham A.; Bikdash, MarwanTraffic estimation using probe vehicle data is a crucial aspect of traffic management as it provides real-time information about traffic conditions. This study introduced a novel framework for traffic density estimation using Temporal Convolutional Network (TCN) for time series data. The study used two datasets collected from a three-leg intersection in Greece and a four-leg intersection in Germany. The model was built to predict the density in an approach of the signalized intersection using features extracted from the other approaches. The results showed that the highest accuracy was achieved when only probe vehicle data was used. This implies that relying solely on probe vehicle data from two approaches can effectively predict traffic density in the third approach, even when the Market Penetration Rate (MPR) is low. The results also indicated that having Signal Phase and Timing (SPaT) information may not be necessary for high accuracy in traffic estimation and that as the MPR increases, the model becomes more predictable.Item Investigation of reusability of effluents from an organized industrial zone wastewater treatment plant using a pressure-driven membrane process(IWA, 2023-10-10) Ocal, Zehra Betül; Karagunduz, Ahmet; Keskinler, Bulent; Dizge, Nadir; Ashqar, HuthaifaThe quantity of wastewater being discharged into the environment due to the rise in industrial activities is progressively growing over time. Aside from large environmental risk posed by untreated wastewater discharge, the reuse of treated water prevents wastage of large amounts of water. For this reason, in this study, the reuse potential of an organized industrial zone wastewater was investigated by membrane processes. The appropriate membrane type and rejection performance were determined for various pollutant parameters including conductivity, chemical oxygen demand (COD), total nitrogen (TN), chloride, and sulfate. Laboratory-scale batch membrane filtration experiments were performed by using three different membrane types (BW30, XLE, and X20). The experiments were conducted at 15 and 20 bar pressures and flux data were collected during the operations. The results showed that BW30 and X20 membranes could be operated comfortably with 80% recovery for the wastewater containing low and high sulfate concentrations. For the wastewater with low sulfate concentration, the fluxes of BW30 and X20 at 20 bar were 19.7 and 16.4 L/m²/h, respectively, at 80% recovery. On the other hand, for the wastewater with higher sulfate concentration, the fluxes of BW30 and X20 at 20 bar were 8.6 and 11.5 L/m²/h, respectively.Item Predictive Analytics in Mental Health Leveraging LLM Embeddings and Machine Learning Models for Social Media Analysis(IGI Global, 2024-01-01) Radwan, Ahmad; Amarneh, Mohannad; Alawneh, Hussam; Ashqar, Huthaifa; AlSobeh, Anas; Magableh, Aws Abed Al RaheemThe prevalence of stress-related disorders has increased significantly in recent years, necessitating scalable methods to identify affected individuals. This paper proposes a novel approach utilizing large language models (LLMs), with a focus on OpenAI's generative pre-trained transformer (GPT-3) embeddings and machine learning (ML) algorithms to classify social media posts as indicative or not of stress disorders. The aim is to create a preliminary screening tool leveraging online textual data. GPT-3 embeddings transformed posts into vector representations capturing semantic meaning and linguistic nuances. Various models, including support vector machines, random forests, XGBoost, KNN, and neural networks, were trained on a dataset of >10,000 labeled social media posts. The top model, a support vector machine, achieved 83% accuracy in classifying posts displaying signs of stress.Item A Generic and Extendable Framework for Benchmarking and Assessing the Change Detection Models(2024-03-20) Hassouna, Ahmed Alaa Abdelbaky; Ismail, Mohamed Badr; Alqahtani, Ali; Alqahtani, Nayef; Hassan, Amany Shaban; Ashqar, Huthaifa; AlSobeh, Anas M. R.; Hassan, Abdallah A.; Elhenawy, MohammedChange Detection (CD) of aerial images refers to identifying and analyzing changes between two or more aerial images of the same location taken at different times. The CD is a highly challenging task due to the need to distinguish relevant changes, such as urban expansion, deforestation, or post-disaster damage assessment, from irrelevant ones, such as light conditions, shadows, and seasonal variations. Many CD papers have recently been published, where most of the papers that proposed a new model contained a comparison between their proposed and state-of-the-art (SOTA) models. While many recent studies propose new deep learning (DL) models for improving CD performance, their comparative analyses are often restricted, lacking comprehensive insights into the proposed models' real-world generalizability, robustness, and performance trade-offs across diverse change characteristics. This paper presents a novel generic framework to systematically benchmark and assess DL-based CD models through three parallel pipelines: 1) cross-testing models on diverse benchmark datasets to evaluate generalization, 2) robustness analysis against different image corruptions, and 3) multi-faceted contour-level analytics evaluating detection sensitivity to change size/complexity. The framework is applied to comparatively evaluate five state-of-the-art DL-based CD models - Changeformer, BIT, Tiny, SNUNet, and CSA-CDGAN. Extensive experiments unveil each model's strengths, limitations and biases, highlighting their relative proficiencies in generalizing across data distributions, resilience to noise corruption, and discriminative capabilities for changes of varying characteristics. The proposed benchmarking framework demonstrates significant potential for guiding the selection of suitable CD models tailored to specific application requirements by comprehensively evaluating their generalizability, robustness, and detection capabilities across diverse real-world scenarios. This systematic evaluation approach can drive future research into developing more robust and versatile CD solutions aligned with practical needs.Item Factors influencing bikeshare service and usage in a rural college town: A case study of Montgomery County, VA(Taylor & Francis, 2024-01-03) Woodson, Cat; Ashqar, Huthaifa; Almannaa, Mohammed; Elhenawy, Mohammed; Buehler, RalphWhile much of the bikeshare boom has centered around larger cities, smaller, lower-density, and even some rural communities have also implemented bikeshare systems successfully. Using a bikeshare dataset of more than 14,000 trips that cover the period from July 2018 to December 2021 for both pedal and e-bikes, this paper describes the structure and performance of ROAM NRV, a bikeshare system in Montgomery County, Virginia—which is home to Virginia Tech university and has many areas classified as rural. The paper presents bikeshare users’ travel behaviors and usage trends (including during the COVID-19 pandemic). Moreover, compares the usage of the system’s pedal bicycles to electric bicycles (e-bikes) that were introduced in 2021. Findings indicated that residents of Blacksburg and Christiansburg regularly use and benefit from bikeshare much like their urban counterparts do. Ridership was noted to likely be more common among university affiliates with trips more likely to start/end on or around campus due to the number of stations located within campus grounds. Trail usage was also high among bikeshare users due to the extensive trail network within and between the towns. As rural bikeshare users tend to travel greater distances and encounter more varying terrains throughout their commutes, considering e-bikes instead of pedal bike systems should increase the utilization of such mobility systems in rural areas. When electric assist bicycles were first introduced to the system, initially replacing some and then all former pedal bicycles, utilization increased significantly compared to pedal bike usage.Item Automated Pavement Cracks Detection and Classification Using Deep Learning(IEEE, 2024-07-11) Nafaa, Selvia; Ashour, Karim; Mohamed, Rana; Essam, Hafsa; Emad, Doaa; Elhenawy, Mohammed; Ashqar, Huthaifa; Hassan, Abdallah A.; Alhadidi, Taqwa I.Monitoring asset conditions is a crucial factor in building efficient transportation asset management. Because of substantial advances in image processing, traditional manual classification has been largely replaced by semi-automatic/automatic techniques. As a result, automated asset detection and classification techniques are required. This paper proposes a methodology to detect and classify roadway pavement cracks using the well-known You Only Look Once (YOLO) version five (YOLOv5) and version 8 (YOLOv8) algorithms. Experimental results indicated that the precision of pavement crack detection reaches up to 67.3% under different illumination conditions and image sizes. The findings of this study can assist highway agencies in accurately detecting and classifying asset conditions under different illumination conditions. This will reduce the cost and time that are associated with manual inspection, which can greatly reduce the cost of highway asset maintenance.Item Advancing Roadway Sign Detection with YOLO Models and Transfer Learning(IEEE, 2024-04) Nafaa, Selvia; Ashour, Karim; Mohamed, Rana; Essam, Hafsa; Emad, Doaa; Elhenawy, Mohammed; Ashqar, Huthaifa; Hassan, Abdallah A.; Alhadidi, Taqwa I.Roadway signs detection and recognition is an essential element in the Advanced Driving Assistant Systems (ADAS). Several artificial intelligence methods have been used widely among of them YOLOv5 and YOLOv8. In this paper, we used a modified YOLOv5 and YOLOv8 to detect and classify different roadway signs under different illumination conditions. Experimental results indicated that for the YOLOv8 model, varying the number of epochs and batch size yields consistent MAP50 scores, ranging from 94.6% to 97.1% on the testing set. The YOLOv5 model demonstrates competitive performance, with MAP50 scores ranging from 92.4% to 96.9%. These results suggest that both models perform well across different training setups, with YOLOv8 generally achieving slightly higher MAP50 scores. These findings suggest that both models can perform well under different training setups, offering valuable insights for practitioners seeking reliable and adaptable solutions in object detection applications.