UMBC Center for Accelerated Real Time Analysis

Permanent URI for this collection

Real time analytics is the leading edge of a smart data revolution, pushed by Internet advances in sensor hardware on one side and AI/ML streaming acceleration on the other. Center for Accelerated Real Time Analytics (CARTA) explores the realm streaming applications of Magna Analytics. The center works with next-generation hardware technologies, like the IBM Minsky with onboard GPU accelerated processors and Flash RAM, a Smart Cyber Physical Sensor Systems to build Cognitive Analytics systems and Active storage devices for real time analytics. This will lead to the automated ingestion and simultaneous analytics of Big Datasets generated in various domains including Cyberspace, Healthcare, Internet of Things (IoT) and the Scientific arena, and the creation of self learning, self correcting “smart” systems.


Recent Submissions

Now showing 1 - 20 of 54
  • Item
    MASON: A Model for Adapting Service-Oriented Grid Applications
    (Springer, 2003) Li, Gang; Wang, Jianwu; Wang, Jing; Han, Yanbo; Zhao, Zhuofeng; Wagner, Roland M.; Hu, Haitao
    Service-oriented computing, which offers more flexible means for application development, is gaining popularity. Service-oriented grid applications are constructed by selecting and composing appropriate services. They are one kind of promising applications in grid environments. However, the dynamism and autonomy of environments make the issues of dynamically adapting a service-oriented grid application urgent. This paper brings forward a model that supports not only monitoring applications through gathering and managing state and structure metadata of service-oriented grid applications, but also dynamic application adjustment by changing the metadata. Besides that, the realization and application of the model is presented also.
  • Item
    An Approach to Abstracting and Transforming Web Services for End-user-doable Construction of Service-Oriented Applications
    (Gesellschaft für Informatik, 2005) Yu, Jian; Fang, Jun; Han, Yanbo; Wang, Jianwu; Zhang, Cheng
    End-user-programmable business-level services composition is an effective way to build virtual organizations of individual applications in a just-intime manner. Challenging issues include how to model business-level services so that the end users can understand and compose them; how to associate businesslevel services to underlying Web services. This paper presents a service virtualization approach called VINCA Virualization to supporting the abstraction, transformation, binding and execution of Web services by end users. Four key mechanisms of VINCA Virualization namely semantics annotation, services aggregation, virtualization operation and services convergence are discussed in details. VINCA Virualization has been implemented and its application in a real-world project is illustrated. The paper concludes with a comparative study with other related works.
  • Item
    An Approach to Domain-Specific Reuse in Service-Oriented Environments
    (Springer, 2008) Wang, Jianwu; Yu, Jian; Falcarin, Paolo; Han, Yanbo; Morisio, Maurizio
    Domain engineering is successful in promoting reuse. An approach to domain-specific reuse in service-oriented environments is proposed to facilitate service requesters to reuse Web services. In the approach, we present a conceptual model of domain-specific services (called domain service). Domain services in a certain business domain are modeled by semantic and feature modeling techniques, and bound to Web services with diverse capabilities through a variability-supported matching mechanism. By reusing pre-modeled domain services, service requesters can describe their requests easily through a service customization mechanism. Web service selection based on customized results can also be optimized by reusing the pre-matching results between domain services and Web services. Feasibility of the whole approach is demonstrated on an example.
  • Item
    A High-Level Distributed Execution Framework for Scientific Workflows
    (2008) Wang, Jianwu; Altintas, Ilkay; Berkley, Chad; Gilbert, Lucas; Jones, Matthew B.
  • Item
    Accelerating Parameter Sweep Workflows by Utilizing Ad-hoc Network Computing Resources: an Ecological Example
    Wang, Jianwu; Altintas, Ilkay; Hosseini, Parviez R.; Barseghian, Derik; Crawl, Daniel; Berkley, Chad; Jones, Matthew B.
  • Item
    Facilitating e-Science Discovery Using Scientific Workflows on the Grid
    (Springer, 2011-01-01) Wang, Jianwu; Korambath, Prakashan; Kim, Seonah; Johnson, Scott; Jin, Kejian; Crawl, Daniel; Altintas, Ilkay; Smallen, Shava; Labate, Bill; Houk, Kendall N.
    e-Science has been greatly enhanced from the developing capability and usability of cyberinfrastructure. This chapter explains how scientific workflow systems can facilitate e-Science discovery in Grid environments by providing features including scientific process automation, resource consolidation, parallelism, provenance tracking, fault tolerance, and workflow reuse. We first overview the core services to support e-Science discovery. To demonstrate how these services can be seamlessly assembled, an open source scientific workflow system, called Kepler, is integrated into the University of California Grid. This architecture is being applied to a computational enzyme design process, which is a formidable and collaborative problem in computational chemistry that challenges our knowledge of protein chemistry. Our implementation and experiments validate how the Kepler workflow system can make the scientific computation process automated, pipelined, efficient, extensible, stable, and easy-to-use.
  • Item
    Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems
    (ACM, 2009-11-16) Wang, Jianwu; Crawl, Daniel; Altintas, Ilkay
    MapReduce provides a parallel and scalable programming model for data-intensive business and scientific applications. MapReduce and its de facto open source project, called Hadoop, support parallel processing on large datasets with capabilities including automatic data partitioning and distribution, load balancing, and fault tolerance management. Meanwhile, scientific workflow management systems, e.g., Kepler, Taverna, Triana, and Pegasus, have demonstrated their ability to help domain scientists solve scientific problems by synthesizing different data and computing resources. By integrating Hadoop with Kepler, we provide an easy-to-use architecture that facilitates users to compose and execute MapReduce applications in Kepler scientific workflows. Our implementation demonstrates that many characteristics of scientific workflow management systems, e.g., graphical user interface and component reuse and sharing, are very complementary to those of MapReduce. Using the presented Hadoop components in Kepler, scientists can easily utilize MapReduce in their domain-specific problems and connect them with other tasks in a workflow through the Kepler graphical user interface. We validate the feasibility of our approach via a word count use case.
  • Item
    Provenance for MapReduce-based data-intensive workflows
    (ACM, 2011-11-14) Crawl, Daniel; Wang, Jianwu; Altintas, Ilkay
    MapReduce has been widely adopted by many business and scientific applications for data-intensive processing of large datasets. There are increasing efforts for workflows and systems to work with the MapReduce programming model and the Hadoop environment including our work on a higher-level programming model for MapReduce within the Kepler Scientific Workflow System. However, to date, provenance of MapReduce-based workflows and its effects on workflow execution performance have not been studied in depth. In this paper, we present an extension to our earlier work on MapReduce in Kepler to record the provenance of MapReduce workflows created using the Kepler+Hadoop framework. In particular, we present: (i) a data model that is able to capture provenance inside a MapReduce job as well as the provenance for the workflow that submitted it; (ii) an extension to the Kepler+Hadoop architecture to record provenance using this data model on MySQL Cluster; (iii) a programming interface to query the collected information; and (iv) an evaluation of the scalability of collecting and querying this provenance information using two scenarios with different characteristics.
  • Item
    Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper
    (ACM, 2012-03-30) Altintas, Ilkay; Wang, Jianwu; Crawl, Daniel; Li, Weizhong
    Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges and placing unprecedented demands on traditional single-processor bioinformatics algorithms. Middleware and technologies for scientific workflows and data-intensive computing promise new capabilities to enable rapid analysis of next-generation sequence data. Based on this motivation and our previous experiences in bioinformatics and distributed scientific workflows, we are creating a Kepler Scientific Workflow System module, called "bioKepler", that facilitates the development of Kepler workflows for integrated execution of bioinformatics applications in distributed environments. This vision paper discusses the challenges related to next-generation sequencing data, explains the approaches taken in bioKepler to help with analysis of such data, and presents preliminary results demonstrating these approaches.
  • Item
    Approaches to Distributed Execution of Scientific Workflows in Kepler
    (IOS Press, 2013) Płóciennik, Marcin; Żok, Tomasz; Altintas, Ilkay; Wang, Jianwu; Crawl, Daniel; Abramson, David; Imbeaux, Frederic; Guillerminet, Bernard; Lopez-Caniego, Marcos; Plasencia, Isabel Campos; Pych, Wojciech; Ciecieląg, Pawel; Palak, Bartek; Owsiak, Michał; Frauel, Yann; ITM-TF Contributors
    The Kepler scientific workflow system enables creation, execution and sharing of workflows across a broad range of scientific and engineering disciplines while also facilitating remote and distributed execution of workflows. In this paper, we present and compare different approaches to distributed execution of workflows using the Kepler environment, including a distributed data-parallel framework using Hadoop and Stratosphere, and Cloud and Grid execution using Serpens, Nimrod/K and Globus actors. We also present real-life applications in computational chemistry, bioinformatics and computational physics to demonstrate the usage of different distributed computing capabilities of Kepler in executable workflows. We further analyze the differences of each approach and provide a guidance for their applications.
  • Item
    Comparison of Distributed Data-Parallelization Patterns for Big Data Analysis: A Bioinformatics Case Study
    (2013) Wang, Jianwu; Crawl, Daniel; Altintas, Ilkay; Tzoumas, Kostas; Markl, Volker
    As a distributed data-parallelization (DDP) pattern, MapReduce has been adopted by many new big data analysis tools to achieve good scalability and performance in Cluster or Cloud environments. This paper explores how two binary DDP patterns, i.e., CoGroup and Match, could also be used in these tools. We reimplemented an existing bioinformatics tool,called CloudBurst, with three different DDP pattern combinations. We identify two factors, namely, input data balancing and value sparseness, which could greatly affect the performances using different DDP patterns. Our experiments show: (i) a simple DDP pattern switch could speed up performance by almost two times; (ii) the identified factors can explain the differences well.
  • Item
    Cloud computing in e-Science: research challenges and opportunities
    (Springer, 2014-08-17) Yang, Xiaoyu; Wallom, David; Waddington, Simon; Wang, Jianwu; Shaon, Arif; Matthews, Brian; Wilson, Michael; Guo, Yike; Guo, Li; Blower, Jon D.; Vasilakos, Athanasios V.; Liu, Kecheng; Kershaw, Philip
    Service-oriented architecture (SOA), workflow, the Semantic Web, and Grid computing are key enabling information technologies in the development of increasingly sophisticated e-Science infrastructures and application platforms. While the emergence of Cloud computing as a new computing paradigm has provided new directions and opportunities for e-Science infrastructure development, it also presents some challenges. Scientific research is increasingly finding that it is difficult to handle “big data” using traditional data processing techniques. Such challenges demonstrate the need for a comprehensive analysis on using the above-mentioned informatics techniques to develop appropriate e-Science infrastructure and platforms in the context of Cloud computing. This survey paper describes recent research advances in applying informatics techniques to facilitate scientific research particularly from the Cloud computing perspective. Our particular contributions include identifying associated research challenges and opportunities, presenting lessons learned, and describing our future vision for applying Cloud computing to e-Science. We believe our research findings can help indicate the future trend of e-Science, and can inform funding and research directions in how to more appropriately employ computing technologies in scientific research. We point out the open research issues hoping to spark new development and innovation in the e-Science field.
  • Item
    FlowGate: Towards Extensible and Scalable Web-Based Flow Cytometry Data Analysis
    (ACM, 2015-07-26) Qian, Yu; Kim, Hyunsoo; Purawat, Shweta; Wang, Jianwu; Stanton, Rick; Lee, Alexandra; Xu, Weijia; Altintas, Ilkay; Sinkovits, Robert; Scheuermann, Richard H.
    Recent advances in cytometry instrumentation are enabling the generation of "big data" at the single cell level for the identification of cell-based biomarkers, which will fundamentally change the current paradigm of diagnosis and personalized treatment of immune system disorders, cancers, and blood diseases. However, traditional flow cytometry (FCM) data analysis based on manual gating cannot effectively scale to address this new level of data generation. Computational data analysis methods have recently been developed to cope with the increasing data volume and dimensionality generated from FCM experiments. Making these computational methods easily accessible to clinicians and experimentalists is one of the biggest challenges that algorithm developers and bioinformaticians need to address. This paper describes FlowGate, a novel prototype cyberinfrastructure for web-based FCM data analysis, which integrates graphical user interfaces (GUI), workflow engines, and parallel computing resources for extensible and scalable FCM data analysis. The goal of FlowGate is to allow users to easily access state-of-the-art FCM computational methods developed using different programming languages and software on the same platform, when the implementations of these methods follow standardized I/O. By adopting existing data and information standards, FlowGate can also be integrated as the back-end data analytical platform with existing immunology and FCM databases. Experimental runs of two representative FCM data analytical methods in FlowGate on different cluster computers demonstrated that the task runtime can be reduced linearly with the number of compute cores used in the analysis.
  • Item
    Personalized Active Service Spaces for End-User Service Composition
    (IEEE, 2006-12-11) Han, Jun; Han, Yanbo; Jin, Yan; Wang, Jianwu; Yu, Jian
    End-user service composition is a promising way to ensure flexible, quick and personalized information provision and utilization, and consequently to better cope with spontaneous business requirements. For end-users to compose services directly, issues like service granularity, service organization and business level semantics are critical. End-users will certainly be at loss if they have to select from a long list of available Web services expressed in IT jargons. This article introduces the concept of personalized active service spaces and focuses on the use of business services, service dependency rules, and service personalization rules to support end-user service composition. It addresses two key issues in end-user composition: how to utilize the user preference and context to restrict the scope of applicable services for selection, and how to capture and utilize dependencies or usage patterns between services in order to provide guidance and enforce temporal/sequential restrictions on service invocations for end-user service compositions.
  • Item
    Big Data Applications Using Workflows for Data Parallel Computing
    (IEEE, 2014-04-16) Wang, Jianwu; Crawl, Daniel; Altintas, Ilkay; Li, Weizhong
    In the Big Data era, workflow systems need to embrace data parallel computing techniques for efficient data analysis and analytics. Here, the authors present an easy-to-use, scalable approach to build and execute Big Data applications using actor-oriented modeling in data parallel computing. They use two bioinformatics use cases for next-generation sequencing data analysis to verify the feasibility of their approach.
  • Item
    A Scalable Data Science Workflow Approach for Big Data Bayesian Network Learning
    (IEEE, 2015-11-09) Wang, Jianwu; Tang, Yan; Nguyen, Mai; Altintas, Ilkay
    In the Big Data era, machine learning has more potential to discover valuable insights from the data. As an important machine learning technique, Bayesian Network (BN) has been widely used to model probabilistic relationships among variables. To deal with the challenges of Big Data PN learning, we apply the techniques in distributed data-parallelism (DDP) and scientific workflow to the BN learning process. We first propose an intelligent Big Data pre-processing approach and a data quality score to measure and ensure the data quality and data faithfulness. Then, a new weight based ensemble algorithm is proposed to learn a BN structure from an ensemble of local results. To easily integrate the algorithm with DDP engines, such as Hadoop, we employ Kepler scientific workflow to build the whole learning process. We demonstrate how Kepler can facilitate building and running our Big Data BN learning application. Our experiments show good scalability and learning accuracy when running the application in real distributed environments.
  • Item
    Big data provenance: Challenges, state of the art and opportunities
    (IEEE, 2015-12-28) Wang, Jianwu; Crawl, Daniel; Purawat, Shweta; Nguyen, Mai; Altintas, Ilkay
    Ability to track provenance is a key feature of scientific workflows to support data lineage and reproducibility. The challenges that are introduced by the volume, variety and velocity of Big Data, also pose related challenges for provenance and quality of Big Data, defined as veracity. The increasing size and variety of distributed Big Data provenance information bring new technical challenges and opportunities throughout the provenance lifecycle including recording, querying, sharing and utilization. This paper discusses the challenges and opportunities of Big Data provenance related to the veracity of the datasets themselves and the provenance of the analytical processes that analyze these datasets. It also explains our current efforts towards tracking and utilizing Big Data provenance using workflows as a programming model to analyze Big Data.
  • Item
    Machine learning on big data: Opportunities and challenges
    (Elsevier, 2017-01-12) Zhou, Lina; Pan, Shimei; Wang, Jianwu; Vasilakos, Athanasios V.
    Machine learning (ML) is continuously unleashing its power in a wide range of applications. It has been pushed to the forefront in recent years partly owing to the advent of big data. ML algorithms have never been better promised while challenged by big data. Big data enables ML algorithms to uncover more fine-grained patterns and make more timely and accurate predictions than ever before; on the other hand, it presents major challenges to ML such as model scalability and distributed computing. In this paper, we introduce a framework of ML on big data (MLBiD) to guide the discussion of its opportunities and challenges. The framework is centered on ML which follows the phases of preprocessing, learning, and evaluation. In addition, the framework is also comprised of four other components, namely big data, user, domain, and system. The phases of ML and the components of MLBiD provide directions for identification of associated opportunities and challenges and open up future work in many unexplored or under explored research areas.
  • Item
    A Hybrid Learning Framework for Imbalanced Stream Classification
    (IEEE, 2017-09-11) Zhang, Wenbin; Wang, Jianwu
    The pervasive imbalanced class distribution occurring in real-world stream applications, such as surveillance, security and finance, in which data arrive continuously has sparked extensive interest in the study of imbalanced stream classification. In such applications, the evolution of unstable class concepts is always accompanied and complicated by the skewed class distribution. However, most of the existing methods focus on either class imbalance problem or non-stationary learning problem, the combined approach of addressing both issues has enjoyed relatively little research. In this paper, we propose a hybrid framework for imbalanced stream learning that consists of three components: classifier updating, resampling and cost sensitive classifier. Based on the framework, we propose a hybrid learning algorithm to combine data-level and algorithm-level methods as well as classifier retraining mechanics to tackle class imbalance in data streams. Our experiments using real-world datasets and synthetic datasets show that our proposed hybrid learning algorithm can have better effectiveness and efficiency.