Browsing by Subject "Computer Vision"
Now showing 1 - 14 of 14
Results Per Page
Sort Options
Item BACKDOOR ATTACKS IN COMPUTER VISION: TOWARDS ADVERSARIALLY ROBUST MACHINE LEARNING MODELS(2022-01-01) Saha, Aniruddha; Pirsiavash, Hamed; Computer Science and Electrical Engineering; Computer ScienceDeep Neural Networks (DNNs) have become the standard building block in numerous machine learning applications, including computer vision, speech recognition, machine translation, and robotic manipulation, achieving state-of-the-art performance on complex tasks. The widespread success of these networks has driven their deployment in sensitive domains like health care, finance, autonomous driving, and defense-related applications. However, DNNs are vulnerable to adversarial attacks. An adversary is a person with malicious intent whose goal is to disrupt the normal functioning of a machine learning pipeline. Research has shown that an adversary can tamper with the training process of a model by injecting misrepresentative data (poisons) into the training set. The manipulation is done in a way that the victimās model will malfunction only when a trigger modifies a test input. These are called backdoor attacks. For instance, a backdoored model in a self-driving car might work accurately for days before it suddenly fails to detect a pedestrian when the adversary decides to exploit the backdoor. Vulnerability to backdoor attacks is dangerous when deep learning models are deployed in safety-critical applications. This dissertations studies ways in which state-of-the-art deep learning methods for computer vision are vulnerable to backdoor attacks and proposes defense methods to remedy the vulnerabilities. We push the limits of our current understanding of backdoors and address the following research questions. Can we design practical backdoor attacks? We propose Hidden Trigger Backdoor Attack - a novel clean-label backdoor attack where the poisoned images do not contain a visible trigger. This enables the attacker to keep the trigger hidden until its use at test-time. Is it secure to train models on large-scale public data? Self-supervised learning (SSL) methods for vision have utilized large-scale unlabeled public data to learn rich visual representations. We show that if a small part of the unlabeled training data is poisoned, SSL methods are vulnerable to backdoor attacks. Backdoor attacks are more practical in self-supervised learning, since the use of large unlabeled data makes data inspection to remove poisons prohibitive. Can we design efficient and generalizable backdoor detection methods? We propose a backdoor detection method that optimizes for a set of images which, when forwarded through any model, indicates successfully whether the model contains a backdoor. Our "litmusā test for backdoored models improves on state-of-the-art methods without requiring access to clean data during detection. It is computationally efficient and generalizes to new triggers as well as new architectures.Item Boosting Self-supervised Learning via Knowledge Transfer(2018-01-01) Kavalkazhani Vinjimoor, Ananthachari; Pirsiavash, Hamed; Computer Science and Electrical Engineering; Computer ScienceIn Self-supervised learning (SSL), an auxiliary task is designed to solve a particular problem, also called pretraining on a specific dataset without the need for human annotation. This process is an initial phase of transfer learning, where one learns a model on an auxiliary task and transfer this model to solve another task by fine-tuning on target dataset. In transfer learning, the inherent constraint is to use same model architecture for both pretraining and fine-tuning. This approach gives rise to issues in designing and comparing various models and auxiliary tasks. An example is, one cannot use different model architecture on auxiliary task and target domain task due to the limitations of fine-tuning and training settings. Since model architectures with varying task complexities are being used by researchers, this makes it hard to compare different approaches. The motive of this work is to design a framework that overcomes the above-mentioned limitations. If there is a way to transfer knowledge from a pre-trained model to a target model, then we should be able to use various architectures in both the phases. Towards this goal, we designed a novel framework that separates the auxiliary training from target domain training by developing an effective transfer method based on clustering. We cluster the features computed from pretrained model to obtain pseudo-labels and learn a novel representation to predict the pseudo-labels. The intuition behind this approach is, in a good visual representational space, semantically similar data points must be closer together than dissimilar data points. This metric should be learnt inherently by the network during the pre-training in order generate good features. This approach gives us flexibility in assessing incompatible models like hand-crafted features. This separation enables us to use different model architectures during auxiliary training and target domain training and also experiment with deeper models to learn better representations. We are also able to boost the performance by increasing the complexity of the auxiliary task and then transfer the knowledge from a deeper model to a shallower one. We conducted various experiments on various datasets to evaluate the performance of this method. This framework outperformed all the current state-of-the-art SSL methods on benchmarks datasets. Our method achieved 72.5% mAP on classification task and 57.2% mAP on object detection task of PASCAL VOC dataset.Item Emerging Frontiers in HumanāRobot Interaction(Springer, 2024-03-18) Safavi, Farshad; Olikkal, Parthan; Pei, Dingyi; Kamal, Sadia; Meyerson, Helen; Penumalee, Varsha; Vinjamuri, RamanaEffective interactions between humans and robots are vital to achieving shared tasks in collaborative processes. Robots can utilize diverse communication channels to interact with humans, such as hearing, speech, sight, touch, and learning. Our focus, amidst the various means of interactions between humans and robots, is on three emerging frontiers that significantly impact the future directions of humanārobot interaction (HRI): (i) humanārobot collaboration inspired by humanāhuman collaboration, (ii) brain-computer interfaces, and (iii) emotional intelligent perception. First, we explore advanced techniques for humanārobot collaboration, covering a range of methods from compliance and performance-based approaches to synergistic and learning-based strategies, including learning from demonstration, active learning, and learning from complex tasks. Then, we examine innovative uses of brain-computer interfaces for enhancing HRI, with a focus on applications in rehabilitation, communication, brain state and emotion recognition. Finally, we investigate the emotional intelligence in robotics, focusing on translating human emotions to robots via facial expressions, body gestures, and eye-tracking for fluid, natural interactions. Recent developments in these emerging frontiers and their impact on HRI were detailed and discussed. We highlight contemporary trends and emerging advancements in the field. Ultimately, this paper underscores the necessity of a multimodal approach in developing systems capable of adaptive behavior and effective interaction between humans and robots, thus offering a thorough understanding of the diverse modalities essential for maximizing the potential of HRI.Item Evaluation of the Radon Transform for Line Detection Applications(2020-01-01) Shylla, Achennaki; Chapman, David; Computer Science and Electrical Engineering; Computer ScienceWe evaluate a modified Approximate Discrete Radon Transform (ADRT) for line detection applications. A traditional method for straight line detection is available in the Hough Transform and its variants such as Probabilistic Hough Transform. However, to achieve acceptable performance, the Hough Transform typically applies a binary threshold which decimates the strength of the gradient magnitude, an informative quantity for precise determination of edge-line intensity. In many practical images, it is difficult or impossible to obtain an acceptable threshold for edge detection such that all prominent lines are detected without introducing major artifacts. The Radon Transform overcomes these limitations by performing line detection efficiently in the original Sobel transform thereby preserving gradient magnitude intensity. The Radon Transform has been rarely applied to the detection of straight lines in images because it is often erroneously misunderstood that the forward Radon Transform is inefficient to calculate for these purposes. However, the ADRT is highly efficient to calculate over images due to dynamic programming, which yields O(N^2lgN) speed in computation for an NxN image (of N^2 pixels). Parallelizing this method can reduce computational steps to O(lgN) on O(N^2) processors. We apply and evaluate the ADRT algorithm for detection of straight lines in images over modern datasets, and additionally introduce a novel filtering scheme for detecting local maxima which correspond to line theta detections. Furthermore, we show that application of blurring and non-maximal suppression to the resulting images strengthens peak intensity thereby improving the ability to detect faint lines. The performance of our method is evaluated against traditional line detection algorithms such as the Hough Transform for which we consider the Radon Transform to be a direct improvement, as well as more recent methods such as the Line Segment Detector (LSD). Experimental results suggest that ADRT achieves better detection of faint lines and processing time as compared to the Hough Transform and is more comparable to state-of-the-art techniques in accuracy. We conclude that the ADRT is a mathematically sound method and improvement over the traditional Hough Transform for straight-line detection in images and eliminates the need to decimate the gradient with binary thresholds.Item A Flash Flood Categorization System using Scene-Text Recognition(IEEE, 2018) Basnyat, Bipendra; Roy, Nirmalya; Gangopadhyay, AryyaDetecting flash floods in real-time and taking rapid actions are of utmost importance to save human lives, loss of infrastructures, and personal properties in a smart city. In this paper, we develop a low-cost low-power cyber-physical System prototype using a Raspberry Pi camera to detect the rising water level. We deployed the system in the real word and collected data in different environmental conditions (early morning in the presence of fog, sunny afternoon, late afternoon with sunsetting). We employ image processing and text recognition techniques to detect the rising water level and articulate several challenges in deploying such a system in the real environment. We envision this prototype design will pave the way for mass deployment of the flash flood detection system with minimal human intervention.Item Human-machine Intelligence: A Design Paradigm(2019-01-01) Rahman, Mahbubur; Banerjee, Nilanjan; Computer Science and Electrical Engineering; Computer ScienceIn this age of artificial intelligence, we are witnessing the power of human-machine collaboration in transforming the way we live, work, and solve different problems. Humans and machines can complement each other in resolving intractable and sophisticated issues that are hard or impossible for computers alone. The collaboration achieved great results addressing the problems of digitizing books, detecting star clusters, and transcribing audio and video, etc. Researchers investigated these problems in isolation. There is no clear guideline about why and when human intelligence can be useful and, if so, what design pattern to follow. Integrating humans will add human knowledge, which can help to solve complex, open-ended, and uncertain problems. However, this will also bring the human limitations of less automation, less precision, and biased opinion. Analyzing the tread-off of integrating humans is necessary before designing a collaborative system. In this dissertations, we have addressed the issues described above and propose a collaborative system design paradigm. Analyzing the general architecture of such a system, we found that human intelligence can help at three different functional positions - data preprocessing, feature extraction, and decision making. In all these functional areas, humans can help to improve the performance of a system. We also provide the conditions that will help a system to get rid of humans in the long run. We have developed four different systems that represent all four conditions mentioned above and provide detailed guidelines. We provided detailed steps of integrating humans in decision making, feature extraction, and preprocessing through our weed identification system, resource localization, and group conversation analysis system, respectively. We also explained the conditions and steps to reduce human contribution in the long run through the object detection system. In each system, we showed - a) why humans over computer intelligence are necessary? b) what are the steps to integrate human knowledge to overcome the difficulties? c) what are the trade-off on integrating humans instead of machines?Item Locally Aware Transformer for Person Re-Identification(2021-01-01) Kapil, Siddhant R; Chapman, David; Computer Science and Electrical Engineering; Computer SciencePerson Re-Identification is an important problem in computer vision-basedsurveillance applications, in which the same person is attempted to be identifiedfrom surveillance photographs in a variety of nearby zones. At present, the major-ity of Person re-ID techniques are based on Convolutional Neural Networks (CNNs),but Vision Transformers are beginning to displace pure CNNs for a variety of objectrecognition tasks. The primary output of a vision transformer is a global classifica-tion token, but vision transformers also yield local tokens which contain additionalinformation about local regions of the image. Techniques to make use of these localtokens to improve classification accuracy are an active area of research. We proposea novel Locally Aware Transformer (LA-Transformer) that employs a Parts-basedConvolution Baseline (PCB)-inspired strategy for aggregating globally enhancedlocal classification tokens into an ensemble of?Nclassifiers, whereNis the num-ber of patches. LA-Transformer achieves rank-1 accuracy of 98.27% with standarddeviation of 0.13 on the Market-1501 and 98.7% with standard deviation of 0.2 onthe CUHK03 dataset respectively, outperforming all other state-of-the-art.Item NOD-CC: A Hybrid CBR-CNN Architecture for Novel Object Discovery(Springer, Cham, 2019-08-09) Turner, JT; Floyd, Michael W.; Gupta, Kalyan; Oates, TimDeep Learning methods have shown a rapid increase in popularity due to their state-of-the-art performance on many machine learning tasks. However, these methods often rely on extremely large datasets to accurately train the underlying machine learning models. For supervised learning techniques, the human effort required to acquire, encode, and label a sufficiently large dataset may add such a high cost that deploying the algorithms is infeasible. Even if a sufficient workforce exists to create such a dataset, the human annotators may differ in the quality, consistency, and level of granularity of their labels. Any impact this has on the overall dataset quality will ultimately impact the potential performance of an algorithm trained on it. This paper partially addresses this issue by providing an approach, called NOD-CC, for discovering novel object types in images using a combination of Convolutional Neural Networks (CNNs) and Case-Based Reasoning (CBR). The CNN component labels instances of known object types while deferring to the CBR component to identify and label novel, or poorly understood, object types. Thus, our approach leverages the state-of-the-art performance of CNNs in situations where sufficient high-quality training data exists, while minimizing its limitations in data-poor situations. We empirically evaluate our approach on a popular computer vision dataset and show significant improvements to objects classification performance when full knowledge of potential class labels is not known in advance.Item Person Re-Identification using Vision Transformer with Auxiliary Tokens(2021-01-01) Sharma, Charu; Chapman, David; Computer Science and Electrical Engineering; Computer SciencePerson Re-Identification (re-ID) is an object re-ID problem that aims to re-identify a person by finding an association between the images of a person captured by multiple cameras. Due to its foundational role in computer-vision based video surveillance applications, it is vital to generate a robust feature embedding to represent a person. CNN-based methods are known for their feature learning abilities, and for many years were a prime choice for a person re-ID. In this theses, we explore a method that takes advantage of auxiliary local tokens and the global tokens of the vision transformer to generate the final feature embedding. We also propose a novel blockwise fine-tuning technique that improves the performance of the Vision Transformer. Our model trained with blockwise fine-tuning achieves $96.6$ rank-1 accuracy and $90.3$ mAP score on the Market-1501 dataset. On the CUHK-03 dataset, it achieves $97.5$ rank-1 accuracy and a $95.03$ mAP score. These performances are comparable to many recently published methods for this problem.Item SELF-SUPERVISED LEARNING BY COMPRESSING REPRESENTATIONS FOR LIGHTWEIGHT MODELS(2022-01-01) Abbasi Koohpayegani, Soroush; Pirsiavash, Hamed; Computer Science and Electrical Engineering; Computer ScienceSelf-supervised learning aims to learn good representations with unlabeled data. Recent works have shown that larger models benefit more from self-supervised learning than smaller models. As a result, the gap between supervised and self-supervised learning has been greatly reduced for larger models. In this work, instead of designing a new pseudo task for self-supervised learning, we develop a model compression method to compress an already learned, deep self-supervised model (teacher) to a smaller one (student). We train the student model so that it mimics the relative similarity between the datapoints in the teacher's embedding space. For AlexNet, our method outperforms all previous methods including the fully supervised model on ImageNet linear evaluation (59.0% compared to 56.5%) and on nearest neighbor evaluation (50.7% compared to 41.4%). To the best of our knowledge, this is the first time a self-supervised AlexNet has outperformed supervised one on ImageNet classification. Moreover, we show that our method is effective in a few other applications: reducing the computation precision rather than the model depth only, learning small models for video representations, learning across modalities, and self-distillation.Item Transfer Learning of Grounded Language Models For Use In Robotic Systems(2020-01-01) Jenkins, Patrick; Matuszek, Cynthia; Computer Science and Electrical Engineering; Computer ScienceGrounded language acquisition is the modeling of language as it relates to physical objects in the world. Grounded language models are useful for creating an interface between robots and humans using natural language, but are ineffective when a robot enters a novel environment due to lack of training data. I create a novel grounded language dataset by capturing multi-angle high resolution color and depth images of household objects, then collecting natural language text and speech descriptions of the objects. This dataset is used to train a model that learns associations between the descriptions and the color and depth percepts. Vision and language domains are embedded into an intermediate lower dimensional space through manifold alignment. The model consists of two simultaneously trained neural nets, one each for vision and language. Triplet loss ensures that the two spaces are closely aligned in the embedded space by attracting positive associations and repelling negative ones. First, separate models are trained using the University of Washington RGB-D and UMBC GLD datasets to get baseline results for grounded language acquisition on domestic objects. Then the baseline model trained on the UW RGB-D data is fine tuned through a second round of training on UMBC GLD. This fine tuned model performs better than the model trained only on UMBC GLD, and in less training time. These experiments represent the first steps of the ability to transfer grounded language knowledge from previously trained models on large datasets onto new models operating on robots operating in novel domains.Item Using Text to Improve Classification of Man-Made Objects(2022-01-01) Vartak, Akash Alok; Oates, Tim; Computer Science and Electrical Engineering; Computer SciencePeople identify man-made objects by their visual appearance and the text on them e.g., does a bottle say water or shampoo? We use text as an important visual cue to help distinguish between similar looking objects. This theses explores a novel joint model of visual appearance and textual cues for image classification.We perform this in three functions - (a) Isolating an object in an input image; (b) Extracting text from the image; (c) Training a joint vision/text model. We simplify the task by extracting text separately and presenting it to the model in machine readable format. Such a joint model has utility in many real world challenges where language is interpreted through a sensory perception like vision or sound. The aim of the research is to understand whether visual percepts, when understood in the context of extracted language, will provide a better classification of image objects than using only pure vision to perform image classification. In conclusion, we show that joint classifier models can successfully make use of text present in images to classify objects, provided that the extracted text from images is of high quality and we have the number of images proportional to the number of classification classes.Item Using Web Images & Natural Language for Object Localization in a Robotics Environment(2020-01-20) Rokisky, Justin Douglass; Matuszek, Cynthia; Computer Science and Electrical Engineering; Computer ScienceThe ability for humans to interact with robots via language would allow for more natural interactions between robots and humans. To this end, in this work I introduce a novel approach that allows robots to localize objects from an unbounded set of classes given only a description of a target object. The first part of this work is a performance analysis of current state of the art object detectors and a region proposal approach \cite{UijlingsIJCV2013} on the Autonomous Robot Indoor Dataset \cite{arid}. The second part of this work introduces a three stage natural language guided webly object localization approach and associated experiments to evaluate its performance. The first stage of the approach generates a webly dataset without any manual curation from a human description of the target object. The second stage of the approach uses the webly dataset to train a binary classifier for the target object. Finally, region proposals from selective search \cite{UijlingsIJCV2013} are input to the webly supervised binary classifier and the region proposal with the highest confidence score is returned as the prediction.Item VISUAL COMPUTATIONAL CONTEXT: USING COMPOSITIONS AND NON-TARGET PIXELS FOR NOVEL CLASS DISCOVERY(2019-01-01) Turner, JT; Oates, Tim; Computer Science and Electrical Engineering; Computer ScienceDuring the deep learning revolution in computer science that has occoured since 2006, two factors have pushed our ability to successfully learn from large-scale data sources: exponential growth in computational power and the size and degree of annotation of our datasets. Modern models loaded in the Graphics Processing Unit (GPU) can fill an entire 12 GB Video Random Access Memory (VRAM) graphics card cache; a training task achievable in weeks that would have taken centuries on CPUs from 10 years ago [1]. The standard computer vision dataset at the time - Mixed National Institute of Standards and Technology (MNIST) - consisted of 70, 000 28 ? 28 pixel grayscale images of 10 class labels. The more recent ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset contains over 15 million full color images with 1, 000 different class labels. During this time however, there has been little growth in contextual use in images. Context can be used to identify target objects that may be obfuscated in the input space as well as confirm or deny the existence of objects based on underlying parts. I use context in two main ways to improve object detection and scene understanding. First, I use location and correlation between objects to infer difficult to see and obfuscated objects [2]. In my second study, I further support the necessity of non-target pixels by using background pixels of the image to aid in classification instead of only other objects in the scene. In addition I use case-based reasoning to detect novel objects that were not seen during training, and classify them with other visually similar objects based on their observable parts. I use this case-based reasoning model in conjuction with a CNN to demonstrate the ability to overcome shortcomings of a traditional deep learned network with case-based reasoning.