All publications sorted by year |
2025 |
Ensuring timely coordination between autonomous aircraft is a challenging problem in decentralized air traffic management (ATM) applications for urban air mobility (UAM) scenarios. This paper presents an approach for formally guaranteeing timely progress in a Two-Phase Acknowledge distributed knowledge propagation protocol by probabilistically modeling the delays using the theory of the Multicopy Two-Hop Relay protocol and the M/M/1 queue system. The guarantee states a probabilistic upper bound to the time for progress as a function of the probabilities of the total transmission and processing delays following two specific distributions. The proof uses a general library of formal theories, that can be used for the rigorous mechanical verification of autonomous aircraft coordination protocols using the Athena proof checker and assistant. |
2024 |
This work investigates the application of the Cram{\'e}r-Rao Lower Bound (CRLB) theorem, within the framework of Dynamic Data Driven Applications Systems (DDDAS), in view of the formal verificationof state estimates via stochastic Vector-dependent Functionally Pooled Auto-Regressive (VFP-AR) models. The VFP-AR model is identified via data obtained from wind tunnel experiments on a ``fly-by-feel'' wing structure under multiple flight states (i.e., angle of attack, velocity). The VFP-based CRLB of the state estimates is derived for each true flight state reflecting the state estimation capability of the model considering the data, model, and estimation assumptions. Apart from the CRLB obtained from pristine data and models, CRLBs are estimated using either artificially corrupted testing data and/or sub-optimal models. Comparisons are made between CRLB and state estimations from corrupted and pristine conditions. The verification of the obtained state estimates is mechanically verified the formal proof of the CRLB Theorem using Athena, which provides irrefutable guarantee of soundness as long as specified assumptions are followed. The results of the study indicate the potential of using a CRLB-based formal verification framework for state estimation via stochastic FP time series models. |
2023 |
We present a novel approach to formally verify data-driven models for state classification in the domain of aerospace. Dynamic data-driven application systems (DDDAS) extend first-principles models with dynamically sensed data enabling the diagnosis and healing of aircrafts during flight. The intrinsic complexity of these systems makes them prone to spurious errors caused by unseen conditions. Formal (software) verification is a technique that goes beyond unit testing or statistical model checking, which can only guarantee system correctness over a fraction of the system's input space. {\em Safety envelopes} bound regions of the system's input space where a formal proof of correctness of state classification holds. We focus on two questions for defining safety envelope boundaries: ({\em i}) does the data follow the model? and ({\em ii}) what is the most likely state of the system given the data? We evaluate safety envelopes with data derived from a wind tunnel experiment tackling the problem of stall detection given piezo-electric sensor measurements over a wing's skin. We define tailored metrics to show the quality of different data-driven models. We encode safety envelopes in the proof assistant Agda, and illustrate their applicability across a variety of input dimensions and Gaussian Process Regression Model (GPRM)-generated data. |
Developments in autonomous aircraft, such as electrical vertical take-off and landing vehicles and multicopter drones, raise safety-critical concerns in populated areas. This article presents the Analysis of Safety-Critical Systems Using Formal Methods-Based Runtime Evaluation (ASSURE) framework, which is a collection of techniques for aiding in the formal verification of safety-critical aerospace systems. ASSURE supports the rigorous verification of deterministic and nondeterministic properties of both distributed and centralized aerospace applications by using formal theorem proving tools. We present verifiable algorithms and software, formal reasoning models, formal proof libraries, and a data-driven runtime verification approach for aerospace systems toward a provably safe Internet of Planes infrastructure. |
We introduce a framework using session types for denoting the relationships between multiple actors in a system. We prove that ascribing to this framework guarantees deadlock freedom, and demonstrate tests of how an implementation of such would work in the SALSA language. |
2022 |
As autonomous vehicular technologies such as self-driving cars and uncrewed aircraft systems (UAS) evolve to become more accessible and cost-efficient, autonomous multi-agent systems, that comprise of such entities, will become ubiquitous in the near future. The close operational proximity between such autonomous agents will warrant the need for multi-agent coordination to ensure safe operations. In this thesis, we adopt a formal methods-based approach to investigate multi-agent coordination for safety-critical autonomous multi-agent systems. We explore algorithms that can be used for decentralized multi-agent coordination among autonomous mobile agents by communicating over asynchronous vehicle-to-vehicle (V2V) networks that can be prone to agent failures. In particular, we study two types of distributed algorithms that are useful for decentralized coordination — consensus, which can be used by autonomous agents to agree on a set of compatible operations; and knowledge propagation, which can be used to ensure sufficient situational awareness in autonomous multi-agent systems. We develop the first machine-checked proof of eventual progress for the Synod consensus algorithm, that does not assume a unique leader. To consider agent failures while reasoning about progress, we introduce a novel Failure-Aware Actor Model (FAM). We then propose a formally verified Two-Phase Acknowledge Protocol (TAP) for knowledge propagation that can establish a safe state of knowledge suitable for autonomous vehicular operations. The non-deterministic and dynamic operating conditions of distributed algorithms deployed over asynchronous V2V networks make it challenging to provide appropriate formal guarantees for the algorithms. To address this, we introduce probabilistic correctness properties that can be developed by stochastically modeling the systems. We present a formal proof library that can be used for reasoning about probabilistic properties of distributed algorithms deployed over V2V networks. We also propose a Dynamic Data-Driven Applications Systems (DDDAS)-based approach for the runtime verification of distributed algorithms. This approach uses parameterized proofs, which can be instantiated at runtime, and progress envelopes, which can divide the operational state space into distinct regions where a proof of progress may or may not hold. To motivate our verification of decentralized coordination, we introduce an autonomous air traffic management (ATM) technique for multi-aircraft systems called Decentralized Admission Control (DAC). |
Successfully attaining consensus in the absence of a centralized coordinator is a fundamental problem in distributed multi-agent systems. We analyze progress in the Synod consensus protocol — which does not assume a unique leader — under the assumptions of asynchronous communication and potential agent failures. We identify a set of sufficient conditions under which it is possible to guarantee that a set of agents will eventually attain consensus using Synod. First, a subset of the agents must not permanently fail or exhibit Byzantine failure until consensus is reached, and second, at least one proposal must be eventually uninterrupted by higher-numbered proposals. To formally reason about agent failures, we introduce a failure-aware actor model (FAM). Using FAM, we model the identified conditions and provide a formal proof of eventual progress in Synod. Our proof has been mechanically verified using the Athena proof assistant and, to the best of our knowledge, it is the first machine-checked proof of eventual progress in Synod. |
Spatio-temporal data streams are often related in complex ways, for example, while the airspeed that an aircraft attains in cruise phase depends on the weight it carries, it also depends on many other factors. Some of these factors are controllable such as engine inputs or the airframe's angle of attack, while others contextual, such as air density, or turbulence. It is therefore critical to develop failure models that can help recognize errors in the data, such as an incorrect fuel quantity, a malfunctioning pitot-static system, or other abnormal flight conditions. In this paper, we extend our PILOTS programming language [1] to support machine learning techniques that will help data scientists: (1) create parameterized failure models from data and (2) continuously train a statistical model as new evidence (data) arrives. The linear regression approach learns parameters of a linear model to minimize least squares error for given training data. The Bayesian approach classifies operating modes according to supervised offline training and can discover new statistically significant modes online. As shown in Tuninter 1153 simulation result, dynamic Bayes classifier finds discrete error states on the fly while the error signatures approach requires every error state predefined. Using synthetic data, we compare the accuracy, response time, and adaptability of these machine learning techniques. Future dynamic data driven applications systems (DDDAS) using machine learning can identify complex dynamic data-driven failure models, which will in turn enable more accurate flight planning and control for emergency conditions. |
This chapter attempts to give an answer to the following question: Given an obligation and a set of potentially-inconsistent, ethically-charged beliefs, how can an artificially-intelligent agent ensure that its actions maximize the likelihood that the obligation is satisfied? Our approach to answering this question is in the intersection of several areas of research, including automated planning, reasoning with uncertainty, and argumentation. We exemplify our reasoning framework in a case study based on the famous, heroic ditching of US Airways Flight 1549, an event colloquially known as the ``Miracle on the Hudson.'' |
Unmanned aerial vehicles (UAVs) are becoming a viable platform for sensing and estimation in a wide variety of applications including disaster response, search and rescue, and security monitoring. These sensing UAVs have limited battery and computational capabilities, and thus must offload their data so it can be processed to provide actionable intelligence. We consider a compute platform consisting of a limited number of highly-resourced UAVs that act as mobile edge computing (MEC) servers to process the workload on premises. We propose a novel distributed solution to the collaborative processing problem that adaptively positions the MEC UAVs in response to the changing workload that arises both from the sensing UAVs’ mobility and the task generation. Our solution consists of two key building blocks: (1) an efficient workload estimation process by which the UAVs estimate the task field—a continuous approximation of the number of tasks to be processed at each location in the airspace, and (2) a distributed optimization method by which the UAVs partition the task field so as to maximize the system throughput. We evaluate our proposed solution using realistic models of surveillance UAV mobility and show that our method achieves up to 28% improvement in throughput over a non-adaptive baseline approach. |
2021 |
With recent big data analytics (BDA) proliferation, enterprises collect and transform data to perform predictive analyses on a scale that a few years ago was not possible. BDA methodologies involve business, analytics, and technology domains. Each domain deals with different concern at different abstraction levels, but current BDA development does not consider the formal integration among these domains. Hence, the deployment procedure usually implies rewriting code to be deployed on specific IT infrastructures to obtain software aligned to functional and nonfunctional requirements. Moreover, surveys have reported a high cost and error-prone transition between analytics development (data lab) and productive environments. This thesis explores the challenges faced by stakeholders in BDA application development and presents a domainspecific model (DSM) approach to design, validate, and generate BDA applications from an architectural perspective, bridging the gap between analytics and IT architecture domains. First, we report our survey results on BDA application deployment applied to BDA practitioners to identify current practices and challenges. Second, the ACCORDANT reference architecture with tactics and patterns catalog to facilitate BDA adoption. Next, we present ACCORDANT modeling framework, a DSM to design and deploy BDA applications via the specification of architectural inputs, functional, and deployment views. Then, we state an approach to specify and evaluate constraints using object constraints and semantic reasoning in BDA architectures conforming to ACCORDANT. Finally, we report the results of this proposal’s application based on case studies and a survey that compares our approach and other according to 34 respondents. |
Big data analytics (BDA) applications use machine learning algorithms to extract valuable insights from large, fast, and heterogeneous data sources. New software engineering challenges for BDA applications include ensuring performance levels of data-driven algorithms even in the presence of large data volume, velocity, and variety (3Vs). BDA software complexity frequently leads to delayed deployments, longer development cycles, and challenging performance assessment. This paper proposes a Domain-Specific Model (DSM), and DevOps practices to design, deploy, and monitor performance metrics in BDA applications. Our proposal includes a design process, and a framework to define architectural inputs, software components, and deployment strategies through integrated high-level abstractions to enable QS monitoring. We evaluate our approach with four use cases from different domains to demonstrate a high level of generalization. Our results show a shorter deployment and monitoring times, and a higher gain factor per iteration compared to similar approaches. |
This chapter attempts to give an answer to the following question: Given an obligation and a set of potentially-inconsistent, ethically-charged beliefs, how can an artificially-intelligent agent ensure that its actions maximize the likelihood that the obligation is satisfied? Our approach to answering this question is in the intersection of several areas of research, including automated planning, reasoning with uncertainty, and argumentation. We exemplify our reasoning framework in a case study based on the famous, heroic ditching of US Airways Flight 1549, an event colloquially known as the “Miracle on the Hudson.†|
The ever-increasing complexity of cyber-physical systems is driving the need for assurance of critical infrastructure and embedded systems. However, traditional methods to secure cyber-physical systems—e.g., using cyber best practices, adapting mechanisms from information technology systems, and penetration testing followed by patching—are becoming ineffective. This paper describes, in detail, Verification Evidence and Resilient Design In anticipation of Cybersecurity Threats (VERDICT), a language and framework to address cyber resiliency. When we use the term resiliency, we mean hardening a system such that it anticipates and withstands attacks. VERDICT analyzes a system in the face of cyber threats and recommends design improvements that can be applied early in the system engineering process. This is done in two steps: (1) Analyzing at the system architectural level, with respect to cyber and safety requirements and (2) by analyzing at the component behavioral level, with respect to a set of cyber-resiliency properties. The framework consists of three parts: (1) Model-Based Architectural Analysis and Synthesis (MBAAS); (2) Assurance Case Fragments Generation (ACFG); and (3) Cyber Resiliency Verifier (CRV). The VERDICT language is an Architecture Analysis and Design Language (AADL) annex for modeling the safety and security aspects of a system’s architecture. MBAAS performs probabilistic analyses, suggests defenses to mitigate attacks, and generates attack-defense trees and fault trees as evidence of resiliency and safety. It can also synthesize optimal defense solutions—with respect to implementation costs. In addition, ACFG assembles MBAAS evidence into goal structuring notation for certification purposes. CRV analyzes behavioral aspects of the system (i.e., the design model)—modeled using the Assume-Guarantee Reasoning Environment (AGREE) annex and checked against cyber resiliency properties using the Kind 2 model checker. When a property is proved or disproved, a minimal set of vital system components responsible for the proof/disproof are identified. CRV also provides rich and localized diagnostics so the user can quickly identify problems and fix the design model. This paper describes the VERDICT language and each part of the framework in detail and includes a case study to demonstrate the effectiveness of VERDICT—in this case, a delivery drone. |
Applications for data-driven systems are expected to be correct implementations of the system specifications, but developers usually test against a few indicative scenarios to verify them. In the absence of exhaustive testing, errors may occur in real time scenarios, especially when dealing with large data streams from moving objects like multicopters, vehicles, etc. Model checking techniques also lack scalability and completeness. We present a novel approach based on some existing tools which enables a developer to write high level code directly as system specifications and simultaneously be able to prove the correctness of the generated code. We present a fault detection and identification (FDI) software development approach using declarative programming language: PILOTS. The grammar of PILOTS has been updated to enable easier syntax for threshold validation techniques. The failure detection model is described as high level specifications that the generated code has to adhere to. The complete FDI problem is formally specified using Hoare logic and proven correct using an automated proof assistant: Dafny. A case study of rotor failures in a hexacopter has been used to illustrate the approach and visualize the results. |
This paper presents an approach and tools for automatic generation of security assurance case fragments using patterns for arguing the security of cyber physical systems. The fragments are generated using augmented Goal Structuring Notation (GSN) and can succinctly convey a system's resilience to cyber-threats specified in MITRE's Common Attack Pattern Enumeration and Classification (CAPEC). The GSN schema has been augmented with additional metadata that can be used for visually tracing back to component-level CAPEC threats from higher-level cyber security claims, enabling designers to easily locate flaws in a model when one or more claims cannot be substantiated. An implementation of the approach as a part of the Verification Evidence and Resilient Design in Anticipation of Cybersecurity Threats (VERDICT) toolchain has also been demonstrated along with a case study of a package delivery drone. |
Successfully attaining consensus in the absence of a centralized coordinator is a fundamental problem in distributed multi-agent systems. We analyze progress in the Synod consensus protocol—which does not assume a unique leader—under the assumptions of asynchronous communication and potential agent failures. We identify a set of sufficient conditions under which it is possible to guarantee that a set of agents will eventually attain consensus. First, a subset of the agents must behave correctly and not permanently fail until consensus is reached, and second, at least one proposal must be eventually uninterrupted by higher-numbered proposals. To formally reason about agent failures, we introduce a failure-aware actor model (FAM). Using FAM, we model the identified conditions and provide a formal proof of eventual progress in Synod. Our proof has been mechanically verified using the Athena proof assistant and, to the best of our knowledge, it is the first machine-checked proof of eventual progress in Synod. |
Autonomous air traffic management (ATM) operations for urban air mobility (UAM) will necessitate the use of distributed protocols for decentralized coordination between aircraft. As UAM operations are time-critical, it will be imperative to have formal guarantees of progress for the distributed protocols used in ATM. Under asynchronous settings, message transmission and processing delays are unbounded, making it impossible to provide deterministic bounds on the time required to make progress. We present an approach for formally guaranteeing timely progress in a Two-Phase Acknowledge distributed knowledge propagation protocol by probabilistically modeling the delays using theories of the Multicopy Two-Hop Relay protocol and the M/M/1 queue system. The guarantee states a probabilistic upper bound to the time for progress as a function of the probabilities of the total transmission and processing delays being less than two given values. We also showcase the development of a library of formal theories, that is tailored towards reasoning about timely progress in distributed protocols deployed in airborne networks, in the Athena proof assistant. |
Operational Risk Assessment (ORA) is a process used to demonstrate and verify that the resultant risk of a planned operation meets certain required safety standards. Subject matter experts (SME) from different domains often use different approaches and terminologies to design ORAs. This leads to long review cycles and creates potential for inconsistent understanding of risks and/or application of mitigations by different practitioners across the safety-risk-chain. In this paper, to formalize ORA data representation we propose a set of common terminologies to be used while capturing ORA data. The proposed terminologies trace to existing standards and to terminologies used in risk data visualization methodologies. We also present a formal data model for ORA, that uses the proposed terminologies, in SADL (Semantic Application Design Language), thereby allowing SMEs to capture their knowledge as formal artifacts that are amenable to machine manipulation and automation. Furthermore, since ORA data is often captured in an excel format, we illustrate the use of an excel template that uses the proposed terminologies, by capturing assessment data corresponding to an example use case scenario in the template. Finally, to enable visualization of the ORA data, we discuss representing them as Bowtie diagrams. A Bowtie diagram is a pictorial representation that captures the relationship between a hazard, its causes and its consequences in a given specific environment or system state. To enable the benefits of Bowtie representation we map the proposed ORA terminologies to elements in a Bowtie model. We illustrate visualization of the ORA data as a Bowtie diagram by generating a Bowtie diagram capturing the ORA data corresponding to the example use case scenario considered in the paper. |
Successfully attaining consensus in the absence of a centralized coordinator is a fundamental problem in distributed multi-agent systems. We analyze progress in the Synod consensus protocol—which does not assume a unique leader—under the assumptions of asynchronous communication and potential agent failures. We identify a set of sufficient conditions under which it is possible to guarantee that a set of agents will eventually attain consensus. First, a subset of the agents must behave correctly and not permanently fail until consensus is reached, and second, at least one proposal must be eventually uninterrupted by higher-numbered proposals. To formally reason about agent failures, we introduce a failure-aware actor model (FAM). Using FAM, we model the identified conditions and provide a formal proof of eventual progress in Synod. Our proof has been mechanically verified using the Athena proof assistant and, to the best of our knowledge, it is the first machine-checked proof of eventual progress in Synod. |
2020 |
Intelligent aerospace systems of the future are expected to be ``smarter'' and more self-sufficient in terms of self-diagnosis, self-healing, overall situational awareness, and safe navigation. This will be facilitated by access to a vast amount of real-time data from on-board sensors, other aircraft, ground stations, and satellites. Dynamic data-driven application systems (DDDAS) can be used to incorporate real-time data for creating high-fidelity models to aid in flight-diagnosis and decision-making. DDDAS techniques accommodate the fusion of dynamic-data, algorithms, computation, and interpretation, making them apposite for use in safety-critical cyber-physical systems. In safety-critical systems, it is important to have irrefutable system guarantees. Formal methods allow the development of machine-checked correctness proofs for such guarantees, facilitating the verification of such systems on infinite states. This chapter presents formal correctness envelopes, analogous to performance envelopes of an aircraft, which represent the operating conditions under which the guarantees regarding a system's properties hold. These correctness envelopes are data-driven, allowing them to be computed, quantified, and monitored in real-time. Examples of safety envelopes and progress envelopes, which represent subsets of the DDDAS state space where safety and progress guarantees of a system are valid respectively, have been discussed. Correctness sentinels are executable programs that use the notion of correctness envelopes to monitor real-time data-streams and detect the status of a system's state with respect to relevant envelopes during runtime. At any given point of time, correctness sentinels can provide useful information about which system properties can be guaranteed and with how much confidence. |
This work presents the investigation and critical assessment, within the framework of Dynamic Data Driven Applications Systems (DDDAS), of two probabilistic state awareness approaches for fly-by-feel aerial vehicles based on (i) stochastic adaptive time-dependent time series models and (ii) Bayesian learning via homoscedastic and heteroscedastic Gaussian process regression models (GPRMs). Stochastic time-dependent autoregressive (TAR) time series models with adaptive parameters are estimated via a recursive maximum likelihood (RML) scheme and used to represent the dynamic response of a self-sensing composite wing under varying flight states. Bayesian learning based on homoscedastic and heteroscedastic versions of GPRM is assessed via the ability to represent the nonlinear mapping between the flight state and the vibration signal energy of the wing. The experimental assessment is based on a prototype self-sensing UAV wing that is subjected to a series of wind tunnel experiments under multiple flight states. |
Big data analytics (BDA) applications use machine learning to extract valuable insights from large, fast, and heterogeneous data sources. The architectural design and evaluation of BDA applications entail new challenges to integrate emerging machine learning algorithms with cutting-edge practices whilst ensuring performance levels even in the presence of large data volume, velocity, and variety (3Vs). This paper presents a design process approach based on the Attribute-Driven Design (ADD) method and Architecture tradeoff analysis method (ATAM) to specify, deploy, and monitor performance metrics in BDA applications supported by domain-specific modeling and DevOps. Our design process starts with the definition of architectural drivers, followed by functional and deployment specification through integrated high-level modeling which enables quality scenarios monitoring. We used two use cases from avionics to evaluate this proposal, and the preliminary results suggest advantages by integrating multiple views, automating deployment and monitoring compared to similar approaches. |
Safety envelopes are meant to determine under which conditions and state space regions a probabilistic property of a data-driven system can be asserted with high confidence. Dynamic data-driven applications systems (DDDAS) can make use of safety envelopes to be cognizant of the formal warranties derived from their models and assumptions. An example of safety envelopes is presented as the intersection of two simpler concepts: {\$}{\$}z{\$}{\$}-predictability and {\$}{\$}{\backslash}tau {\$}{\$}-confidence; which correspond to state estimation and classification, respectively. To illustrate safety envelopes, stall detection from signal energy is shown with data gathered by piezo-electric sensors in a composite wing inside a wind tunnel under varying angles of attack and airspeed configuration. A formalization of these safety envelopes is presented in the Agda proof assistant, from which formally proven sentinel code can be generated. |
We present a framework for scheduling multifunction serverless applications over a hybrid public-private cloud. A set of serverless jobs is input as a batch, and the objective is to schedule function executions over the hybrid platform to minimize the cost of public cloud use, while completing all jobs by a specified deadline. As this scheduling problem is NP-Hard, we propose a greedy algorithm that dynamically determines both the order and placement of each function execution using predictive models of function execution time and network latencies. We present a prototype implementation of our framework that uses AWS Lambda and OpenFaaS, for the public and private cloud, respectively. We evaluate our prototype in live experiments using a mixture of compute and I/O heavy serverless applications. Our results show that our framework can achieve a speedup in batch processing of up to 1.92 times that of an approach that uses only the private cloud, at 40.5% the cost of an approach that uses only the public cloud |
Autonomous (and partially autonomous) agents are beginning to play significant roles in safetycritical and privacy-critical domains, such as driving and healthcare. When humans operate in these spaces, not only are there regulations and laws dictating proper behavior, but crucially, neurobiologically normal humans can be expected to comprehend how to reason with certain principles to ensure that their actions are legally/ethically/prudentially correct (whether or not these humans choose to abide by the principles in question). It seems reasonable that we should hold autonomous agents to, minimally, the same standard we hold humans to. In this paper, we present a framework for autonomous aircraft piloting agents to reason about ethical problems in the context of emergency landings. In particular, we are concerned with ethical problems in which every option is equally unethical with regard to the ethical principles the options violate; and the only distinguishing factor is the likelihood that a plan will violate an ethical principle. We conclude by discussing why, in general, we find an inference-theoretic approach to ethical reasoning to be superior to the model-theoretic approach of prior work. |
Aircraft sensors measure physical quantities to help pilots and flight automation systems with situational awareness and decision making. Unfortunately, some important quantities of interest (QoI), e.g. aircraft weight, cannot be directly measured by sensors. As a consequence, accidents can happen, exemplified by Tuninter 1153 and Cessna 172R N4207P, where the airplanes were underweight (not enough fuel) and overweight (6% over maximum gross weight) respectively. Learning models to infer QoI from other aircraft sensor data is thus critical to safety through analytical redundancy. In this paper, we extend PILOTS, our declarative programming language for stream analytics, to learn models from data. We illustrate the supervised machine learning extensions to PILOTS with an example where we use take-off speed profiles under different density altitudes and runway conditions to estimate aircraft weight. Using data we collected from the XPlane flight simulator for a Cessna 172SP, we compare the results of several models on accuracy and timeliness. We also consider ensemble learning to improve the accuracy of weight estimation during takeoff from 94.3% (single model) to 97% (multiple models). Given that the average length of a take-off is 26.75s, this model was able to converge within 10% of the correct weight after 10.7s and converge within 5% after 17.7s. On August 25th, 2014, a Cessna 172R, N4207P, crashed killing the pilot and three passengers. The National Transportation Safety Board (NTSB) report calculated the aircraft to be 1.06 times the maximum gross weight. We simulated the take-off in X-Plane using information from the report. We were able to estimate within 5% error after 8s, which is less than 200ft down the runway, and at the point of take-off, 27s, had an error of 3%. This means that our model could have alerted the pilot of an overweight condition well before the aircraft became airborne, leaving more than 2000ft of runway to come to a stop. |
The ever-increasing complexity of cyber physical systems drives the need for assurance of critical infrastructure and embedded systems. Building assurance cases is a way to increase confidence in systems. In general, the construction of assurance cases is a manual process and the resulting artifacts are not machine analyzable. The High Assurance Systems team at GE Research is developing technology to support generation of formalized assurance cases for systems, which are both humanreadable and machine-analyzable. We have developed a Semantic Application Design Language Assurance Toolkit (SADL-AT) including a semantic model to formalize the Goal Structuring Notation for assurance cases. This paper describes the toolkit SADL-AT and demonstrates the capabilities and effectiveness of SADL-AT by building security and safety assurance case fragments for an unmanned aerial vehicle-based example – a delivery drone. |
This work presents formal progress envelopes applied to flight systems for distinctly classifying a system's state space into regions where a formal proof of progress for a distributed algorithm holds or does not hold. It also presents an approach for runtime integration of formal methods in the dynamic data-driven applications systems (DDDAS) architecture using parameterized proofs. Finally, it showcases the development of reusable parameterized proof libraries for high-level statistical and stochastic reasoning in the Athena proof assistant and demonstrates their use with a progress proof for the Paxos distributed consensus protocol. |
In autonomous air-traffic management scenarios of the future, manned and unmanned aircraft will be able to safely navigate through the National Airspace System, independent of centralized air-traffic controllers, by sharing critical data necessary for maintaining standard separation with each other. Under such conditions, every aircraft must have sufficient knowledge about other aircraft sharing the airspace to operate safely. In this paper, we specify such a state of knowledge and present a formally verified distributed knowledge propagation protocol, which guarantees that this state will eventually be attained, leading to heightened collaborative situational awareness among the aircraft. We use the TLA$^+$ Specification Language to specify our protocol and some safety-critical correctness properties. We also provide mechanically-verified proofs of the correctness properties, under a set of suitable operating conditions of the system, by using the TLA+ Proof System. |
In aviation, there are many values that are useful for pilots to know that cannot be measured from sensors and require calculation using various charts or other means to accurately estimate. Aircraft take-off distance requires knowing wind speed, pressure altitude, and temperature. However, it is possible for the inputs of these calculations to change during flight, or be calculated incorrectly by mistake, and it would be useful for pilots to know these values in real-time. PILOTS is a programming language for spatio-temporal data stream processing. We have added improved integration for machine learning algorithms as well as a linguistic abstraction for training these models. In data-driven systems, it can be useful to use distributed processes for computation. We have designed a declarative framework for federated learning and the aggregation of results from multiple related models within PILOTS. Furthermore, we built a model using PILOTS that is able to estimate weight in real-time during take-off of a fixed-wing aircraft using data available from the avionics. We evaluated the results of several models on accuracy and timeliness. Data was collected from the flight simulator X-Plane. Accidents such as the fatal crash of Cessna 172R N4207P could have been prevented using the weight estimation methods illustrated. |
2019 |
Desktop cloud is an opportunistic platform that provides cloud computing services on desktop computers, typically located on a university or business campus. The desktop cloud systems take advantage of the idle resources of computers when their users perform routine activities, or when computers are fully available. A desktop cloud manages these resources to run virtual machines, with their operating systems and applications, without affecting the performance perceived by the users of the computers. Virtual machines allow users of a desktop cloud, typically researchers, to execute their academic or scientific applications at the same time as the processes of the users of desktop computers. Since the infrastructure of a desktop cloud is based on non-dedicated computers, these systems are more susceptible to fail, compared to traditional cloud computing providers. These platforms must face the interruptions and interference caused by the users of desktop computers and their applications. For example, the users of the computers can turn them off or restart them, disconnect them from the network, or execute demanding applications in computational capacity affecting the normal execution of virtual machines. Desktop clouds have been used successfully in the execution of bag-of-tasks type applications, where a problem is solved by dividing it into independent tasks that run in parallel. In this type of applications, since the applications that run are independent, any of them can be executed on another physical machine, if a node on the platform fails. Other applications, with processes that communicate with each other, for example applications based on Message Passing Interface (MPI), are more fragile when facing failures, since a failure in one node can affect the entire system. There are several implementations of desktop clouds or similar systems, such as CernVM, cuCloud, ad hoc cloud computing, GBAC and UnaCloud, our case study. These systems take advantage of idle resources of the participating computers for the execution of virtual machines. Due to the opportunistic access to resources, in general, desktop cloud systems are platforms that offer a best effort delivery service, in which there are no guarantees about the successful execution of the applications that are executed in the virtual machines. However, these platforms can move towards reliable service delivery despite the volatility of the computational resources on which it is supported. In this way, new applications can be executed and some guarantee in the service can be offered to its users. In this doctoral thesis we have made and refined a fault analysis, considering different desktop cloud and similar systems, emphasizing UnaCloud. As a result of this analysis, we found that desktop cloud systems present failures mainly in two moments: during the provisioning and during the execution of virtual machines. We have proposed an extended chain of threats and we have used it in the identification of the main faults that occur in these two phases to determine their causes (anomalies, interruptions and errors) and their consequences. In addition, we suggest some mitigation strategies to counteract anomalies and interruptions. We have compiled an initial set of strategies that allow us to mitigate the effects of the identified failures, we developed a functional prototype to respond to those failures using those strategies that have the greatest impact in our context and we evaluate their behavior. Thanks to this analysis, we found that the provisioning of virtual machines has significant limitations in scalability. Virtual images are large files whose transmission through the network is delayed and prone to failures. In addition, reliability problems begin when the number of virtual machines that are going to run on the system begins to grow, particularly because of the space they occupy on disk, the use of the network and the time it takes for provisioning. On the other hand, the actions carried out by the users of the computers, or the applications that they launch, can interrupt the execution of the virtual machines running in the desktop cloud and ruin the work done up to that moment. We have proposed a solution to mitigate the effects of failures in both the provisioning and the execution of virtual machines. First, implementation of a new virtual machine provisioning model for a desktop cloud based on a catalog of preconfigured virtual images, stored in multitasking writing disks and previously stored in the computers where the virtual machines run. Second, we have developed a global snapshot solution to store the state of a distributed system that runs in the virtual machines of a desktop cloud. Our implementation allows us to obtain multiple global snapshots and a mechanism to resume execution consistently from any of them, without missing or duplicated messages. We have validated the software developed through functional and performance tests and verified that the proposed solutions can be used to improve the reliability of a desktop cloud system. |
We have developed a method for estimating the properties of the progenitor dwarf galaxy from the tidal stream of stars that were ripped from it as it fell into the Milky Way. In particular, we show that the mass and radial profile of a progenitor dwarf galaxy evolved along the orbit of the Orphan Stream, including the stellar and dark matter components, can be reconstructed from the distribution of stars in the tidal stream it produced. We use MilkyWay@home, a PetaFLOPS-scale distributed supercomputer, to optimize our dwarf galaxy parameters until we arrive at best-fit parameters. The algorithm fits the dark matter mass, dark matter radius, stellar mass, radial profile of stars, and orbital time. The parameters are recovered even though the dark matter component extends well past the half light radius of the dwarf galaxy progenitor, proving that we are able to extract information about the dark matter halos of dwarf galaxies from the tidal debris. Our simulations assumed that the Milky Way potential, dwarf galaxy orbit, and the form of the density model for the dwarf galaxy were known exactly; more work is required to evaluate the sources of systematic error in fitting real data. This method can be used to estimate the dark matter content in dwarf galaxies without the assumption of virial equilibrium that is required to estimate the mass using line-of-sight velocities. This demonstration is a first step towards building an infrastructure that will fit the Milky Way potential using multiple tidal streams. |
Recent trends in artificial intelligence and machine learning (AI/ML), dynamic data driven application systems (DDDAS), and cloud computing provide opportunities for enhancing multidomain systems performance. The DDDAS framework utilizes models, measurements, and computation to enhance real-time sensing, performance, and analysis. One example the represents a multi-domain scenario is “fly-by-feel†avionics systems that can support autonomous operations. A "fly-by-feel" system measures the aerodynamic forces (wind, pressure, temperature) for physics-based adaptive flight control to increase maneuverability, safety and fuel efficiency. This paper presents a multidomain approach that identifies safe flight operation platform position needs from which models, data, and information are invoked for effective multidomain control. Concepts are presented to demonstrate the DDDAS approach for enhanced multi-domain coordination bringing together modeling (data at rest), control (data in motion) and command (data in use). |
Dynamic data-driven application systems (DDDAS) allow for unprecedented self-healing and self-diagnostic behavior across a broad swathe of domains. The usefulness of these systems is offset against their inherent complexity, and therefore fragility to specification or implementation error. Further, DDDAS techniques are often applied in safety-critical domains, where correctness is paramount. Formal methods facilitate the development of correctness proofs about software systems, which provide stronger behavioral guarantees than non-exhaustive unit tests. While unit testing can validate that a system behaves correctly in some finite number of configurations, formal methods enable us to prove correctness in an infinite subset of the configuration space, which is often needed in cyber-physical systems involving continuous mechanics. Although the efficacy of formal methods is traditionally offset by significantly greater development cost, we propose new development techniques that can mitigate this concern. In this paper, we explore novel techniques for assuring the correctness of data-driven systems based on certified programming and software verification. In particular, we focus on the use of interactive theorem-proving systems to prove foundational properties about data-driven systems, possibly reliant upon physics-based assumptions and models. We introduce the concept of the formal safety envelope, analogous to the concept of an aircraft’s performance envelope, which organizes system properties in a way that makes it clear which properties hold under which assumptions. Beyond maintaining modularity in proof development, this technique furthermore enables the derivation of runtime monitors to detect potentially unsafe system state changes, allowing the user to know precisely which properties have been verified to hold for the current system state. Using this method, we demonstrate the partial verification of an archetypal data-driven system from avionics, where wing sensor data is used to determine whether or not an airplane is likely to be in a stall state. |
There are widespread and increasing interest in big data analytics (BDA) solutions to enable data collection, transformation, and predictive analyses. The development and operation of BDA application involve business innovation, advanced analytics and cutting-edge technologies which add new complexities to the traditional software development. Although there is a growing interest in BDA adoption, successful deployments are still scarce (a.k.a., the ``Deployment Gap'' phenomenon). This paper reports an empirical study on BDA deployment practices, techniques and tools in the industry from both the software architecture and data science perspectives to understand research challenges that emerge in this context. Our results suggest new research directions to be tackled by the software architecture community. In particular, competing architectural drivers, interoperability, and deployment procedures in the BDA field are still immature or have not been adopted in practice. |
Big data analytics (BDA) applications use advanced analysis algorithms to extract valuable insights from large, fast, and heterogeneous data sources. These complex BDA applications require software design, development, and deployment strategies to deal with volume, velocity, and variety (3vs) while sustaining expected performance levels. BDA software complexity frequently leads to delayed deployments, longer development cycles and challenging performance monitoring. This paper proposes a DevOps and Domain Specific Model (DSM) approach to design, deploy, and monitor performance Quality Scenarios (QS) in BDA applications. This approach uses high-level abstractions to describe deployment strategies and QS enabling performance monitoring. Our experimentation compares the effort of development, deployment and QS monitoring of BDA applications with two use cases of near mid-air collisions (NMAC) detection. The use cases include different performance QS, processing models, and deployment strategies. Our results show shorter (re)deployment cycles and the fulfillment of latency and deadline QS for micro-batch and batch processing. |
We have created a highly declarative programming language called PILOTS that enables error detection and estimation of correct data streams based on analytical redundancy (i.e., algebraic relationship between data streams). Data scientists are able to express their analytical redundancy models with the domain specific grammar of PILOTS and test their models with erroneous data streams. PILOTS has the ability to express a single analytical redundancy, and it has been successfully applied to data from aircraft accidents such as Air France flight 447 and Tuninter flight 1153 where only one simultaneous sensor type failure was observed. In this work, we extend PILOTS to support multiple models of analytical redundancy and improve situational awareness for multiple simultaneous sensor type failures. Motivated by the two recent accidents involving the Boeing 737 Max 8, which was potentially caused by a faulty angle of attack sensor, we focus on recovering angle of attack data streams under multiple sensor type failure scenarios. The simulation results show that multiple models of analytical redundancy enable us to detect failure modes that are not detectable with a single model. |
We describe an algorithm that can fit the properties of the dwarf galaxy progenitor of a tidal stream, given the properties of that stream. We show that under ideal conditions (the Milky Way potential, the orbit of the dwarf galaxy progenitor, and the functional form of the dwarf galaxy progenitor are known exactly), the density and angular width of stars along the stream can be used to constrain the mass and radial profile of both the stellar and dark matter components of the progenitor dwarf galaxy that was ripped apart to create the stream. Our provisional fit for the parameters of the dwarf galaxy progenitor of the Orphan Stream indicates that it is less massive and has fewer stars than previous works have indicated. |
We present a novel conflict-aware flight planning approach that avoids the possibility of near mid-air collisions (NMACs) in the flight planning stage. Our algorithm computes a valid flight-plan for an aircraft (ownship) based on a starting time, a set of discrete way-points in 3D space, discrete values of ground speed, and a set of available flight-plans for traffic aircraft. A valid solution is one that avoids loss of standard separation with available traffic flight-plans. Solutions are restricted to permutations of constant ground speed and constant vertical speed for the ownship between consecutive waypoints. Since the course between two consecutive way-points is not changed, this strategy can be used in situations where vertical or lateral constraints due to terrain or weather may restrict deviations from the original flight-plan. This makes our approach particularly suitable for unmanned aerial systems (UAS) integration into urban air traffic management airspace. Our approach has been formally verified using the Athena proof assistant. Our work, therefore, complements the state-of-the-art pairwise tactical conflict resolution approaches by enabling an ownship to generate strategic flight-plans that ensure standard separation with multiple traffic aircraft, while conforming to possible restrictions on deviation from its flight path. |
2018 |
Infrastructure-as-a-Service (IaaS) clouds such as Amazon EC2 offer various types of virtual machines (VMs) through pay-per-use pricing. Elastic resource allocation allows us to allocate and release VMs as computing demand changes while satisfying Quality-of-Service (QoS) requirements. In this thesis, we explore QoS-aware elastic resource allocation for three different data processing models: batch, micro-batch, and streaming. First, we present two frameworks for elastic batch data processing. The first elastic batch data processing framework supports autonomous VM scaling using application-level migration. It does not require any prior knowledge about the target application, but dynamically reconfigures the application to keep the CPU utilization within a certain range. The second framework uses Workload-tailored Elastic Compute Units as a measure of computing resources analogous to Amazon EC2’s ECUs. Given a deadline, our framework finds the cost-optimal resource configuration of heterogeneous VMs to satisfy the required throughput. Next, we propose an elastic micro-batch data processing framework for continuous air traffic optimization. Air traffic optimization is commonly formulated as an integer linear programming (ILP) problem. For continuous optimization, we periodically solve ILP problems with regular intervals, where each problem is a micro-batch data processing job. Since the fluctuating number of flights creates dynamically changing computational demand, our framework predicts future workload and proactively schedules VMs to solve the ILP problems in a timely manner. Finally, we propose a framework for sustainable elastic stream processing based on the concept of Maximum Sustainable Throughput (MST). It is the maximum processing throughput a streaming application can process indefinitely for a number of VMs. Stream processing is sustainable if the system’s MST is always greater than the input data rates of incoming workload. Using MST and future workload prediction models, our framework proactively schedules VMs to keep the stream processing sustainable. It explicitly incorporates uncertainties in both MST and workload prediction models, and estimates the number of VMs to satisfy a certain probability criteria. Our studies show that QoS-aware elastic data processing is effective for these processing models in both performance scalability and cost savings. For batch processing, elastic resource scheduling helps achieve the target QoS metrics such as CPU utilization and job completion time. For both micro-batch and stream processing with fluctuating workloads, QoS-aware elastic scheduling saves up to 49% cost compared to a static scheduling that covers the peak workload to achieve a similar level of throughput QoS satisfaction. These results show potential for future fully automated cloud computing resource management systems that efficiently enable truly elastic and scalable general-purpose workload. |
The Cloud computing paradigm has revolutionised the computer science horizon during the past decade and has enabled the emergence of computing as the fifth utility. It has captured significant attention of academia, industries, and government bodies. Now, it has emerged as the backbone of modern economy by offering subscription-based services anytime, anywhere following a pay-as-you-go model. This has instigated (1) shorter establishment times for start-ups, (2) creation of scalable global enterprise applications, (3) better cost-to-value associativity for scientific and high-performance computing applications, and (4) different invocation/execution models for pervasive and ubiquitous applications. The recent technological developments and paradigms such as serverless computing, software-defined networking, Internet of Things, and processing at network edge are creating new opportunities for Cloud computing. However, they are also posing several new challenges and creating the need for new approaches and research strategies, as well as the re-evaluation of the models that were developed to address issues such as scalability, elasticity, reliability, security, sustainability, and application models. The proposed manifesto addresses them by identifying the major open challenges in Cloud computing, emerging trends, and impact areas. It then offers research directions for the next decade, thus helping in the realisation of Future Generation Cloud Computing. |
Stream processing systems deployed on the cloud need to be elastic to effectively accommodate workload variations over time. Performance models can predict maximum sustainable throughput (MST) as a function of the number of VMs allocated. We present a scheduling framework that incorporates three statistical techniques to improve Quality of Service (QoS) of cloud stream processing systems: (i) uncertainty quantification to consider variance in the MST model; (ii) online learning to update MST model as new performance metrics are gathered; and (iii) workload models to predict input data stream rates assuming regular patterns occur over time. Our framework can be parameterized by a QoS satisfaction target that statistically finds the best performance/cost tradeoff. Our results illustrate that each of the three techniques alone significantly improves QoS, from 52% to 73-81% QoS satisfaction rates on average for eight benchmark applications. Furthermore, applying all three techniques allows us to reach 98.62% QoS satisfaction rate with a cost less than twice the cost of the optimal (in hindsight) VM allocations, and half of the cost of allocating VMs for the peak demand in the workload. |
We investigate MapReduce-based data aggregation for Internet-of-Things data in a multi-tier, geo-distributed datacenter architecture. Specifically, we consider 1) end-to-end hierarchical data aggregation and 2) query response for aggregated data requests made by geo-distributed clients. We first develop a realistic performance model based on previous empirical studies. We then study application performance for various deployment architectures, ranging from a purely cloud-based approach to a geo-distributed architecture that combines cloud, fog, and edge resources. From simulations created based on U.S. Census data, we characterize the trade-off between end-to-end data aggregation time and query response time. Our experiments show that for data aggregation, a purely-cloud based deployment is 53% faster than a deployment with edge resources; however, for query response, the edge approach is 46% faster due to the edge resource proximity to query clients. |
Loss of thrust (LOT) emergencies create the need for quickly providing pilots with valid trajectories for safely landing the aircraft. It is easy to pre-compute total lost of thrust trajectories for every possible initial point in a 3D flight plan, but it is impossible to predict variables like the availability of partial power, wing surface damage, and wind aloft in advance. Availability of partial power can affect the glide ratio of an aircraft while the presence of wind can significantly affect the trajectory of a gliding aircraft with respect to the ground, e.g. - a tailwind or a headwind can aid or hinder straight line glide by increasing or decreasing the ground speed. Wind can also change the shape of turns from circular to trochoidal, moving an aircraft away from its intended position. In this paper, we present a robust trajectory generation system that can take these dynamic factors into consideration. Our approach outputs valid trajectories to a target runway in the presence of constant, horizontal wind, by using purely geometric criteria for computing flyable trajectories. We model the effect of wind on different components of a possible trajectory by taking into account the observed glide ratio of the aircraft (computed from actual flight performance data) and the horizontal wind vector. We also take into account the effect of wind on ground-speed, the effective glide ratio with respect to the ground, and the shape of turns to calculate trajectories to a virtual point in 3D space which can lead an aircraft to an actual target runway. We introduce an analytical approach for calculating the virtual point for trajectories with left-straight-left or right-straight-right Dubins path segments and a heuristic iterative approach for other cases. Our approach generates trajectories that can lead an aircraft from an initial configuration (latitude, longitude, altitude, heading) to a target configuration in the presence of a constant horizontal wind. In our experiments, the computation time for trajectories ranged from 40 milliseconds to 60 milliseconds. |
Loss of thrust emergencies, e.g. – induced by bird strikes or fuel exhaustion – give rise to the need for expeditiously generating feasible trajectories to nearby runways, in order to guide pilots. It is possible to pre-compute total loss of thrust trajectories from every point in a 3D flight plan, but dynamic factors which affect the feasibility of a trajectory, like partial power, wind conditions, and aircraft surface damage, cannot be predicted beforehand. We present a dynamic data-driven avionics software approach for emergency aircraft trajectory generation which can account for these factors. Our approach updates a damaged aircraft performance model during flight which is used for generating valid trajectories to a safe landing site. This model is parameterized on a baseline glide ratio (g0) for a clean aircraft configuration, assuming best gliding airspeed on straight flight. The model predicts purely geometric criteria for flight trajectory generation, namely, glide ratio and radius of turn for different bank angles and drag configurations. Our model can dynamically infer the most accurate baseline glide ratio of an aircraft from real-time aircraft sensor data. We further introduce a trajectory utility function to rank trajectories for safety, in particular, to prevent steep turns close to the ground and to remain as close to the airport or landing zone as possible. Wind can significantly affect a feasible gliding trajectory with respect to the ground by changing the shape of turns from circular to trochoidal, and by increasing or decreasing the effective ground speed. Thus, in the presence of wind, otherwise feasible trajectories may become infeasible. Therefore, we present an additional wind model that takes into account the observed baseline glide ratio of an aircraft and the horizontal wind vector (−→w ). Our dynamic data-driven system uses this wind model to generate wind-aware trajectories that are feasible in the presence of a steady, horizontal wind. As a use case, we consider the Hudson River ditching of US Airways 1549 in January 2009, using a flight simulator to evaluate our trajectories and to get sensor data (airspeed, GPS location, and barometric altitude). In this example, baseline glide ratios of 17.25:1 and 19:1 enabled us to generate trajectories up to 28 seconds and 36 seconds after the birds strike respectively. We were also able to generate a feasible wind-assisted trajectory when trajectories were not possible in the absence of wind. In our experiments, the computation time for a single trajectory ranged from 40 milliseconds to 60 milliseconds. |
2017 |
Ubiquitous sensing is pervasive in society for such applications as biometrics for health care, smart grids for power delivery, and avionics for transportation safety. As society continues to rely ever more on sensors for various applications, there is a need to address the accuracy of sensor readings for health maintenance, signal identification, and control. While there have been advances in information fusion for avionics control and user warnings, there is still a need for further research in methods that allow for fault detection and recovery techniques to be easily realized and implemented with minimal risk of software errors. |
In sensor-based systems, spatio-temporal data streams are often related in non-trivial ways. For example in avionics, while the airspeed that an aircraft attains in cruise phase depends on the weight it carries, it also depends on many other factors such as engine inputs, angle of attack, and air density. It is therefore a challenge to develop failure models that can help recognize errors in the data, such as an incorrect fuel quantity or an incorrect airspeed. In this paper, we present a highly-declarative programming framework that facilitates the development of self-healing avionics applications, which can detect and recover from data errors. Our programming framework enables specifying expert-created failure models using error signatures, as well as learning failure models from data. To account for unanticipated failure modes, we propose a new dynamic Bayes classifier, that detects outliers and upgrades them to new modes when statistically significant. We evaluate error signatures and our dynamic Bayes classifier for accuracy, response time, and adaptability of error detection. While error signatures can be more accurate and responsive than dynamic Bayesian learning, the latter method adapts better due to its data-driven nature. |
Recently, a new concept called desktop cloud emerged, which was developed to offer cloud computing services on non-dedicated resources. Similarly to cloud computing, desktop clouds are based on virtualization, and like other computational systems, may experience faults at any time. As a consequence, reliability has become a concern for researchers. Fault-tolerance strategies focused on independent virtual machines include snapshots (checkpoints) to resume the execution from a healthy state of a virtual machine on the same or another host, which is trivial because hypervisors provide this function. However, it is not trivial to obtain a global snapshot of a distributed system formed by applications that communicate among them because the concept of global clock does not exist, so it can not be guaranteed that snapshots of each VM will be taken at the same time. Therefore, some protocol is needed to coordinate the participants to obtain a global snapshot. In this paper, we propose a global snapshot protocol called UnaCloud Snapshot for its application in the context of desktop clouds over TCP/IP networks. That differs from other proposals that use a virtual network to inspect and manipulate the traffic circulating among virtual machines making it difficult to apply them to more realistic environments. We obtain a consistent global snapshot for a general distributed system running on virtual machines that maintains the semantics of the system without modifying applications running on virtual machines or hypervisors. A first prototype was developed and the preliminary results of our evaluation are presented. |
In cloud-based stream processing services, the maximum sustainable throughput (MST) is defined as the maximum throughput that a system composed of a fixed number of virtual machines (VMs) can ingest indefinitely. If the incoming data rate exceeds the system's MST, unprocessed data accumulates, eventually making the system inoperable. Thus, it is important for the service provider to keep the MST always larger than the incoming data rate by dynamically changing the number of VMs used by the system. In this paper, we identify a common data processing environment used by modern data stream processing systems, and we propose MST prediction models for this environment. We train the models using linear regression with samples obtained from a few VMs and predict MST for a larger number of VMs. To minimize the time and cost for model training, we statistically determine a set of training samples using Intel's Storm benchmarks with representative resource usage patterns. Using typical use-case benchmarks on Amazon's EC2 public cloud, our experiments show that, training with up to 8 VMs, we can predict MST for streaming applications with less than 4% average prediction error for 12 VMs, 9% for 16 VMs, and 32% for 24 VMs. Further, we evaluate our prediction models with simulation-based elastic VM scheduling on a realistic workload. These simulation results show that with 10% over-provisioning, our proposed models' cost efficiency is on par with the cost of an optimal scaling policy without incurring any service level agreement violations. |
Loss of thrust emergencies-e.g., induced by bird/drone strikes or fuel exhaustion-create the need for dynamic data-driven flight trajectory planning to advise pilots or control UAVs. While total loss of thrust trajectories to nearby airports can be pre-computed for all initial points in a 3D flight plan, dynamic aspects such as partial power and airplane surface damage must be considered for accuracy. In this paper, we propose a new Dynamic Data-Driven Avionics Software (DDDAS) approach which during flight updates a damaged aircraft performance model, used in turn to generate plausible flight trajectories to a safe landing site. Our damaged aircraft model is parameterized on a baseline glide ratio for a clean aircraft configuration assuming best gliding airspeed on straight flight. The model predicts purely geometric criteria for flight trajectory generation, namely, glide ratio and turn radius for different bank angles and drag configurations. Given actual aircraft performance data, we dynamically infer the baseline glide ratio to update the damaged aircraft model. Our new flight trajectory generation algorithm thus can significantly improve upon prior Dubins based trajectory generation work by considering these data-driven geometric criteria. We further introduce a trajectory utility function to rank trajectories for safety. As a use case, we consider the Hudson River ditching of US Airways 1549 in January 2009 using a flight simulator to evaluate our trajectories and to get sensor data. In this case, a baseline glide ratio of 17.25:1 enabled us to generate trajectories up to 28 seconds after the birds strike, whereas, a 19:1 baseline glide ratio enabled us to generate trajectories up to 36 seconds after the birds strike. DDDAS can significantly improve the accuracy of generated flight trajectories thereby enabling better decision support systems for pilots in emergency conditions. |
In cloud-based stream processing services, the maximum sustainable throughput (MST) is defined as the maximum throughput that a system composed of a fixed number of virtual machines (VMs) can ingest indefinitely. If the incoming data rate exceeds the system's MST, unprocessed data accumulates, eventually making the system inoperable. Thus, it is important for the service provider to keep the MST always larger than the incoming data rate by dynamically changing the number of VMs used by the system. In this paper, we identify a common data processing environment used by modern data stream processing systems, and we propose MST prediction models for this environment. We train the models using linear regression with samples obtained from a few VMs and predict MST for a larger number of VMs. To minimize the time and cost for model training, we statistically determine a set of training samples using Intel's Storm benchmarks with representative resource usage patterns. Using typical use-case benchmarks on Amazon's EC2 public cloud, our experiments show that, training with up to 8 VMs, we can predict MST for streaming applications with less than 4% average prediction error for 12 VMs, 9% for 16 VMs, and 32% for 24 VMs. Further, we evaluate our prediction models with simulation-based elastic VM scheduling on a realistic workload. These simulation results show that with 10% over-provisioning, our proposed models' cost efficiency is on par with the cost of an optimal scaling policy without incurring any service level agreement violations. |
The Cloud computing paradigm has revolutionised the computer science horizon during the past decade and has enabled the emergence of computing as the fifth utility. It has captured significant attention of academia, industries, and government bodies. Now, it has emerged as the backbone of modern economy by offering subscription-based services anytime, anywhere following a pay-as-you-go model. This has instigated (1) shorter establishment times for start-ups, (2) creation of scalable global enterprise applications, (3) better cost-to-value associativity for scientific and high performance computing applications, and (4) different invocation/execution models for pervasive and ubiquitous applications. The recent technological developments and paradigms such as serverless computing, software-defined networking, Internet of Things, and processing at network edge are creating new opportunities for Cloud computing. However, they are also posing several new challenges and creating the need for new approaches and research strategies, as well as the re-evaluation of the models that were developed to address issues such as scalability, elasticity, reliability, security, sustainability, and application models. The proposed manifesto addresses them by identifying the major open challenges in Cloud computing, emerging trends, and impact areas. It then offers research directions for the next decade, thus helping in the realisation of Future Generation Cloud Computing. |
2016 |
Developing standalone applications running on a single computer is very different from developing scalable applications running on the cloud, such as data analytics applications that process terabytes of data, Web applications that receive thousands of requests per second, or distributed computing applications where components run simultaneously across many computers. Cloud computing service providers help facilitate the development of these complex applications through their cloud programming frameworks. A cloud programming framework is a software platform to develop applications in the cloud that takes care of nonfunctional concerns, such as scalability, elasticity, fault tolerance, and load balancing. Using cloud programming frameworks, application developers can focus on the functional aspects of their applications and benefit from the power of cloud computing. In this chapter, we will show how to use some of the existing cloud programming frameworks in three application domains: data analytics, Web applications, and distributed computing. More specifically, we will explain how to use MapReduce (Dean and Ghemawat, 2008) for data analytics, Google App Engine (Google, 2014) for Web applications, and SALSA (Varela and Agha, 2001) for distributed computing. The rest of the chapter is structured as follows. In section 50.2, we describe nonfunctional concerns supported at different levels of cloud services and go through existing cloud programming frameworks. In section 50.3, we explain MapReduce, Google App Engine, and Simple Actor Language System and Architecture (SALSA). In section 50.4, we illustrate how to use these three programming frameworks by showing example applications. Finally, we conclude the chapter in section 50.5. |
Spatio-temporal data streams are often related in complex ways, for example, while the airspeed that an aircraft attains in cruise phase depends on the weight it carries, it also depends on many other factors. Some of these factors are controllable such as engine inputs or the airframe’s angle of attack, while others contextual, such as air density, or turbulence. It is therefore critical to develop failure models that can help recognize errors in the data, such as an incorrect fuel quantity, a malfunctioning pitot-static system, or other abnormal flight conditions. In this paper, we extend our PILOTS programming language to support machine learning techniques that will help data scientists: (1) create parameterized failure models from data and (2) continuously train a statistical model as new evidence (data) arrives. The linear regression approach learns parameters of a linear model to minimize least squares error for given training data. The Bayesian approach classifies operating modes according to supervised offline training and can discover new statistically significant modes online. As shown in Tuninter 1153 simulation result, dynamic Bayes classifier finds discrete error states on the fly while the error signatures approach requires every error state predefined. Using synthetic data, we compare the accuracy, response time, and adaptability of these machine learning techniques. Future dynamic data driven applications systems (DDDAS) using machine learning can identify complex dynamic data-driven failure models, which will in turn enable more accurate flight planning and control for emergency conditions. |
Cloud computing adds great on-demand scalability to stream processing systems with its pay-per-use cost model. However, to promise service level agreements to users while keeping resource allocation cost low is a challenging task due to uncertainties coming from various sources, such as the target application's scalability, future computational demand, and the target cloud infrastructure's performance variability. To deal with these uncertainties, it is essential to create accurate application performance prediction models. In cloud computing, the current state of the art in performance modelling remains application-specific. We propose an application-agnostic performance modeling that is applicable to a wide range of applications. We also propose an extension to probabilistic performance prediction. This paper reports the progress we have made so far. |
As we are facing ever increasing air traffic demand, it is critical to enhance air traffic capacity and alleviate human controllers' workload by viewing air traffic optimization as a continuous/online streaming problem. Air traffic optimization is commonly formulated as an integer linear programming(ILP) problem. Since ILP is NP-hard, it is computationally intractable. Moreover, a fluctuating number of flights changes computational demand dynamically. In this paper, we present an elastic middleware framework that is specifically designed to solve ILP problems generated from continuous air traffic streams. Experiments show that our VM scheduling algorithm with time-series prediction can achieve similar performance to a static schedule while using 49% fewer VM hours for a realistic air traffic pattern. |
Dynamic data-driven applications and systems (DDDAS) use models to control systems using a feedback loop, however, they go beyond traditional control-theoretic approaches in that the data acquisition process itself may be controlled by steering the system resources. Thus, there is the opportunity to strategically collect data and modify the models dynamically to incorporate recent observations. This improved modeling capability can then be used in feedback to steer the system towards a desired trajectory or outcome. Such an adaptive, data-driven approach has the potential to significantly improve the modeling capability of complex systems that include streaming data and opens the door to new applications, including smarter, safer vehicles and environments. Examples of such applications include |
2015 |
The 5th International Workshop on Programming based on Actors, Agents, and Decentralized Control (AGERE!), Pittsburgh, PA, USA, October 26, 2015, co-located with SPLASH 2015. This latest edition of AGERE!, an ACM SIGPLAN workshop, confirms its role as a unique venue in the research landscape bringing together researchers and practitioners interested in actors, agents and, more generally, high-level paradigms emphasizing decentralized control in thinking, modeling, developing, and reasoning about software systems. The fundamental turn of software to concurrency and distribution is not only a matter of performance, but also of design and abstraction. It calls for programming paradigms that, compared to current mainstream paradigms, allow us to more naturally think about, design, develop, execute, debug, and profile systems exhibiting different degrees of concurrency, autonomy, decentralization of control, and physical distribution. All stages of software development are considered interesting for the workshop, including requirements, modeling, formalization, prototyping, design, implementation, tooling, testing, and any other means of producing running software based on actors and agents as first-class abstractions. The scope of the workshop includes aspects that concern both the theory and the practice of programming using such paradigms, so as to bring together researchers working on models, languages and technologies, as well as practitioners using such technologies to develop real-world systems and applications |
This paper presents a newly developed implementation of remote message passing, remote actor creation and actor migration in SALSA Lite. The new runtime and protocols are implemented using SALSA Lite’s lightweight actors and asynchronous message passing, and provide significant performance improvements over SALSA version 1.1.5. Actors in SALSA Lite can now be local, the default lightweight actor implementation; remote, actors which can be referenced remotely and send remote messages, but cannot migrate; or mobile, actors that can be remotely referenced, send remote messages and migrate to different locations. Remote message passing in SALSA Lite is twice as fast, actor migration is over 17 times as fast, and remote actor creation is two orders of magnitude faster. Two new benchmarks for remote message passing and migration show this implementation has strong scalability in terms of concurrent actor message passing and migration. The costs of using remote and mobile actors are also investigated. For local message passing, remote actors resulted in no overhead, and mobile actors resulted in 30% overhead. Local creation of remote and mobile actors was more expensive with 54% overhead for remote actors and 438% for mobile actors. In distributed scenarios, creating mobile actors remotely was only 6% slower than creating remote actors remotely, and passing messages between mobile actors on different theaters was only 5.55% slower than passing messages between remote actors. These results highlight the benefits of our approach in implementing the distributed runtime over a core set of efficient lightweight actors, as well as provide insights into the costs of implementing remote message passing and actor mobility. |
Dynamic Data-Driven Avionics Systems (DDDAS) embody ideas from the Dynamic Data- Driven Application Systems paradigm by creating a data-driven feedback loop that analyzes spatio-temporal data streams coming from aircraft sensors and instruments, looks for errors in the data signaling potential failure modes, and corrects for erroneous data when possible. In case of emergency, DDDAS need to provide enough information about the failure to pilots to support their decision making in real-time. We have developed the PILOTS system, which supports data-error tolerant spatio-temporal stream processing, as an initial step to realize the concept of DDDAS. In this paper, we apply the PILOTS system to actual data from the Tuninter 1153 (TU1153) flight accident in August 2005, where the installation of an incorrect fuel sensor led to a fatal accident. The underweight condition suggesting an incorrect fuel indication for TU1153 is successfully detected with 100% accuracy during cruise flight phases. Adding logical redundancy to avionics through a dynamic data-driven approach can significantly improve the safety of flight. |
The device people use to capture multimedia has changed over the years with the rise of smartphones. Smartphones are readily available, easy to use and capture multimedia with high quality. While consumers capture all of this media, the storage requirements are not changing significantly. Therefore, people look towards cloud storage solutions. The typical consumer stores files within a single provider. They want a solution that is quick to access, reliable, and secure. Using multiple providers can reduce cost and improve overall performance. We present a middleware framework called Distributed Indexed Storage in the Cloud (DISC) to improve all aspects a user expects in a cloud provider. The consumer provides the middleware files, which get processed through user policies, and stored within the cloud. The process of uploading and downloading is essentially transparent. The upload and download performance happens simultaneously by distributing a subset of the file across multiple cloud providers that it deems fit based on policies. Reliability is another important feature of DISC. To improve reliability, we propose a solution that replicates the same subset of the file across different providers. This is beneficial when one provider is unresponsive, the data can be pulled from another provider with the same subset. Security has great importance when dealing with consumer’s data. We inherently gain security when improving reliability. Since the file is distributed using subsets, not one provider has the full file. In our experiment, performance improvements show when delivering and retrieving files compared to the standard approach. The results are promising, saving upwards of eight seconds in processing time. With the expansion of more cloud providers, the results are expected to improve. |
The device people use to capture multimedia has changed over the years with the rise of smart phones. Smart phones are readily available, easy to use, and capture multimedia with high quality. While consumers capture all of this media, the storage requirements are not changing significantly. Therefore, people look towards cloud storage solutions. The typical consumer stores files within a single provider. They want a solution that is quick to access, reliable, and secure. Using multiple providers can reduce cost and improve overall performance. We present a middleware framework called Distributed Indexed Storage in the Cloud (DISC) to improve all aspects a user expects in a cloud provider. The process of uploading and downloading is essentially transparent to the user. The upload and download performance happens simultaneously by distributing a subset of the file across multiple cloud providers that it deems fit based on policies. Reliability is another important feature of DISC. To improve reliability, we propose a solution that replicates the same subset of the file across different providers. This is beneficial when one provider is unresponsive, the data can be pulled from another provider with the same subset. Security has great importance when dealing with consumers data. We inherently gain security when improving reliability. Since the file is distributed using subsets, not one provider has the full file. In our experiment, performance improvements are observed when delivering and retrieving files compared to the standard approach. The results are promising, saving upwards of eight seconds in processing time. With the expansion of more cloud providers, the results are expected to improve. |
To analyze data distributed across the world, one can use distributed computing power to take advantage of data locality and achieve higher throughput. The multi-cloud model, a composition of multiple clouds, can provide cost-effective computing resources to process such distributed data. As multicolour becomes more and more accessible from cloud users, the use of MapReduce/Hadoop over multi-cloud is emerging, however, existing work has two issues in principle. First, it mainly focuses on maximizing throughput by improving data locality, but the perspective of cost optimization is missing. Second, conventional centralized optimization methods would not be able to scale well in multi-cloud environments due to its highly dynamic nature. We plan to solve the first issue by formalizing an optimization framework for MapReduce over multi-cloud including virtual machine and data transfer costs, and then the second issue by creating decentralized resource management middleware that considers multi-criteria (cost and performance) optimization. This paper reports progress we have made so far on these two directions. |
Dynamic Data-Driven Avionics Systems (DDDAS) embody ideas from the Dynamic Data-Driven Application Systems paradigm by creating a data-driven feedback loop that analyzes spatiotemporal data streams coming from aircraft sensors and instruments, looks for errors in the data signaling potential failure modes, and corrects for erroneous data when possible. |
2014 |
The 4th International Workshop on Programming based on Actors, Agents, and Decentralized Control (AGERE!) is a unique venue in the research landscape bringing together researchers and practitioners interested in actors, agents and, more generally, high-level paradigms emphasizing decentralized control in thinking, modeling, developing, and reasoning about programs and software systems. The fundamental turn of software into concurrency and distribution is not only a matter of performance, but also of design and abstraction. It calls for programming paradigms that, compared to current mainstream paradigms, would allow us to more naturally think about, design, develop, execute, debug, and profile systems exhibiting different degrees of concurrency, autonomy, decentralization of control, and physical distribution. AGERE! is an ACM SIGPLAN workshop dedicated to focusing on and developing research on programming systems, languages and applications based on actors, agents and any related programming paradigm promoting a decentralized mindset in solving problems and in developing systems to implement such solutions. All stages of software development are considered interesting for the workshop, including requirements, modeling, formalization, prototyping, design, implementation, tooling, testing, and any other means of producing running software based on actors and agents as first-class abstractions. The scope of the workshop includes aspects that concern both the theory and the practice of design and programming using such paradigms, so as to bring together researchers working on models, languages and technologies, as well as practitioners using such technologies to develop real-world systems and applications. AGERE! 2014 received 14 submissions, of which 9 full papers were accepted to be included in these proceedings. Since the first edition, a main objective of the workshop has been to explore and foster the adoption of actors, agents and paradigms based on a decentralized control as a more high-level and effective alternative --- from an abstraction point of view in particular --- to mainstream approaches such as multi-threaded programming. Among others, this calls for devising technologies featuring a good level of maturity and performance. This is reflected by the contributions accepted in this edition. |
As modern computer processors continue becoming more parallel, the actor model plays an increasingly important role in helping develop correct concurrent systems. In this paper, we consider efficient runtime strategies for non-distributed actor programming languages. While the focus is on a non-distributed implementation, it serves as a platform for a future efficient distributed implementation. Actors extend the object model by combining state and behavior with a thread of control, which can significantly simplify concurrent programming. Further, with asynchronous communication, no shared memory, and the fact an actor only processes one message at a time, it is possible to easily implement transparent distributed message passing and actor mobility. This paper discusses SALSA Lite, a completely re-designed actor runtime system engineered to maximize performance. The new runtime consists of a highly optimized core for lightweight actor creation, message passing, and message processing, which is used to implement more advanced coordination constructs. This new runtime is novel in two ways. First, by default the runtime automatically maps the lightweight actors to threads, allowing the number of threads used by a program to be specified at runtime transparently, without any changes to the code. Further, language constructs allow programmers to have first class control over how actors are mapped to threads (creating new threads if needed). Second, the runtime directly maps actor garbage collection to object garbage collection, allowing non-distributed SALSA programs to use Java’s garbage collection “for freeâ€. This runtime is shown to have comparable or better performance for basic actor constructs (message passing and actor creation) than other popular actor languages: Erlang, Scala, and Kilim. |
The transactor model, an extension to the actor model, specifies an operational semantics to model concurrent systems with globally consistent distributed state. The semantics formalizes tracks dependencies among loosely coupled distributed components to ensure fault tolerance through a two-phase commit protocol and to issue rollbacks in the presence of failures or state inconsistency. In this paper, we introduce the design of a transactor language as an extension of an existing actor language and highlight the capabilities of this programming model. We developed our transactor language using SALSA, an actor language developed as a dialect of Java. We first develop a basic transactor SALSA/Java library, which implements the fundamental semantics of the transactor model following the operational semantics' transition rules. We then illustrate two example programs written using this library. Furthermore, we introduce a state storage abstraction known as the Uniform Storage Locator following the Universal Actor Name and Universal Actor Locator abstractions from SALSA that uses a storage service to maintain checkpointed transactor states. The transactor model guarantees safety but not progress. Therefore, to help develop realistic transactor programs that make progress, we introduce the Consistent Distributed State Protocol and Ping Director that improve upon the Universal Checkpointing Protocol to aid transactor programs in reaching globally consistent distributed states. |
We are developing a hierarchy of theories to reason about actor systems, with the ability to reuse proofs formalized at an abstract level in reasoning about concrete actor programs. Several actor languages, e.g., the ABCL family of languages, implement First-In First-Out (FIFO) communication between actors. Furthermore, many practical systems require FIFO communication for correctness. In previous work, Musser and Varela formalized properties including monotonicity of actor local states, guaranteed message delivery, and general con- sequences of fairness. While the actor model requires fairness, it does not require FIFO communication. In this thesis, we extend the actor reasoning framework to enable proving correctness of systems which require FIFO communication. This is done by extending the actor framework within the Athena proof system, in which proofs are both humanreadable and machine- checkable, taking advantage of its library of algebraic and relational theories. We introduce three new theories into the actor model framework of Athena. All three of these theories are developed at the abstract level, enabling the use of them in many concrete programs. The first two of these theories introduce sequence numbers into the messages passed between actors, one for sending and one for receiving. We take advantage of the monotonicity of actor transitions to show that send sequence numbers and receive sequence numbers will only ever increase. The third new theory begins to prove the ordering of messages given an order of the sequence numbers. We use results from the first two theories to show that if two messages are about to be sent or received, then the order in which the messages are sent or received is dictated by the sequence numbers. We then use that result to show that two messages must be received in the same order in which they were sent. We continue on to show an example of an actor system, based on the computation of the Sieve of Eratosthenes, that requires FIFO communication in order to be able to prove correctness of its computation. |
Many have leveraged virtual machine cloning, migration and other features of virtualization to improve and guarantee high availability and performance of their applications. Virtual machine malleability is the ability to split and merge virtual machines based on demand and performance, to improve virtualization functionality. The splitting function will take a virtual machine with a workload of N independent processes and split the independent processes into at most N virtual machines. Similarly, the merging function will take at most N previously-split virtual machines and merge their independent processes into a single virtual machine. There has been previous research in malleability at the application layer. However, most of the benefits of application layer malleability can also be realized at the virtual machine layer. These benefits include dynamic workload reconfigurability while affording better transparency. The benefits provided by virtual machine malleability can be put into 3 categories: scalability and elasticity, energy improvements and process scheduling. Traditionally, the approach of application scaling at the application layer involves writing an application with malleability in mind. Using our method of scaling, applications do not need to be written with malleability in mind to be able to scale, as long as the application workloads are independent or they can communicate over a network. The approach of energy optimization typically involves consolidating virtual machines onto fewer hosts and putting idle hosts in a low power or off state. Although this method works quite well, certain virtual machine configurations cannot be consolidated without significantly impacting performance. Using virtual machine malleability, processes running on a virtual machine that is on an overloaded physical machine can be split across multiple virtual machines, each with a smaller resource footprint. Each of these split machines can then be consolidated without causing the physical hosts to become overloaded. Such flexibility in virtual machine scheduling enables better energy consumption. Our experiments with virtual machine malleability show that application scalability is possible with negligible performance degradation, that it is possible to dynamically reconfigure workloads to achieve set resource utilization, which in turn improves energy utilization. We evaluate scalability and elasticity by splitting and merging virtual machines running many instances of the same CPU intensive process. Our evaluation of splitting and merging demonstrates linear performance changes. Finally, we discuss some future work which includes the design of middleware for managing virtual machine malleability. |
The Transactor model, an extension to the Actor Model, is a well-formulated way to model distributed concurrent systems while maintaining a global distributed state. This model provides semantics to track dependencies among loosely-coupled distributed components that ensures fault tolerance through a two-phase commit protocol and issues rollbacks in the presence of failures or state inconsistency. We seek to introduce an implementation of this model through a transactor language as an extension of an existing actor language and highlight the capabilities of this programming model. We developed our transactor language using SALSA, an actor language developed as a dialect of Java. This transactor language will in turn be a dialect of SALSA, which similarly will compile into Java code. We first develop our transactor language as a basic SALSA/Java library in conjunction with the SALSA Java library. This library implements the fundamental semantics of the transactor model following the defined transition rules. We then provide example programs written using this library such as the reference cell, banking transfer, and the house purchase transaction. Furthermore, we seek to expand our language by introducing a state storage property known as the Universal Storage Location as an extension of the Universal Actor Name and Universal Actor Locator in SALSA that levies a Storage Service to maintain checkpointed transactor states. We also introduce the Consistent Transaction Protocol and Ping Director that improves upon the Universal Checkpointing Protocol to aid transactor programs in reaching globally consistent states and perform reliable transactions. |
The use of mobile devices has been steadily increasing, for example, smartphones are expected to be used by 69.4% of people worldwide by 2017. User expectations have also increased as the computational capabilities of mobile devices improve. As a result, software applications need to perform ever more complex and data-intensive tasks to address user expectations. Because of resource limitations in mobile devices (e.g., battery, limited network connectivity) we investigate the cloud computing paradigm as a means of augmenting mobile device capabilities. In this thesis, we first study the potential reduction in computation time for a mobile device application that offloads part or all of its execution to remote resources, such as a tablet, a laptop, a desktop, or a private/public cloud. Second, we apply the actor model of concurrent computation to reconfigure a distributed application from the mobile device to the cloud. Specifically, we use the SALSA actor programming language, which allows developers to easily create computationally intensive applications that can be broken apart and migrated to various computational resources. Since SALSA programs compile down to Java byte code, we can readily run them in the Android Operating System through an extension to the SALSA language. Lastly, we aim for a separation of concerns by specifying policies that govern when and where to move actors, separately from the functional application code. Using our Mobile Cloud Computing using Actors (MobileCCA) approach, as applied to a face recognition task, we observed speedups on average of ~5x in the private cloud with respect to doing the computation on the mobile device. Furthermore, we were able to perform the face recognition task on a database of 1000 faces for 400 people, a task beyond the resource capabilities of the mobile device alone. MobileCCA therefore illustrates not only the potential to speedup computations in mobile devices and save battery, but also to enhance the power of mobile applications. |
2013 |
We measure the spatial density of F turnoff stars in the Sagittarius dwarf tidal stream, from Sloan Digital Sky Survey data, using statistical photometric parallax. We find a set of continuous, consistent parameters that describe the leading Sgr stream's position, direction, and width for 15 stripes in the north Galactic cap, and three stripes in the south Galactic cap. We produce a catalog of stars that has the density characteristics of the dominant leading Sgr tidal stream that can be compared with simulations. We find that the width of the leading (north) tidal tail is consistent with recent triaxial and axisymmetric halo model simulations. The density along the stream is roughly consistent with common disruption models in the north, but possibly not in the south. We explore the possibility that one or more of the dominant Sgr streams has been misidentified, and that one or more of the "bifurcated" pieces is the real Sgr tidal tail, but we do not reach definite conclusions. If two dwarf progenitors are assumed, fits to the planes of the dominant and "bifurcated" tidal tails favor an association of the Sgr dwarf spheroidal galaxy with the dominant southern stream and the "bifurcated" stream in the north. In the north Galactic cap, the best fit Hernquist density profile for the smooth component of the stellar halo is oblate, with a flattening parameter q = 0.53, and a scale length of r 0 = 6.73. The southern data for both the tidal debris and the smooth component of the stellar halo do not match the model fits to the north, although the stellar halo is still overwhelmingly oblate. Finally, we verify that we can reproduce the parameter fits on the asynchronous MilkyWay@home volunteer computing platform. |
Cloud computing's pay-per-use model greatly reduces upfront cost and also enables on-demand scalability as service demand grows or shrinks. Hybrid clouds are an attractive option in terms of cost benefit, however, without proper elastic resource management, computational resources could be over-provisioned or under-provisioned, resulting in wasting money or failing to satisfy service demand. In this paper, to accomplish accurate performance prediction and cost-optimal resource management for hybrid clouds, we introduce Workload-tailored Elastic Compute Units (WECU) as a measure of computing resources analogous to Amazon EC2's ECUs, but customized for a specific workload. We present a dynamic programming-based scheduling algorithm to select a combination of private and public resources which satisfy a desired throughput. Using a loosely-coupled benchmark, we confirmed WECUs have 24 (J\% better runtime prediction ability than ECUs on average. Moreover, simulation results with a real workload distribution of web service requests show that our WECU-based algorithm reduces costs by 8-31\% compared to a fixed provisioning approach.) |
Self-healing spatio-temporal data streaming systems enable error detection and data correction based on error signatures. Error signatures are mathematical function patterns with constraints and are used to identify and categorize errors in redundant spatio-temporal data streams. In this paper, we apply these methods to real data from a private Cessna flight and from the Air France AF447 accident in June 2009. For the private Cessna flight, three error scenarios are simulated: pitot tube failure, GPS failure, and simultaneous pitot tube and GPS failures. The error detection accuracy is approximately 93% and the response time to correct data is at most 5 seconds. For the AF447 flight, 162 seconds of available flight data including the pitot tubes failure is collected from the accident report. The pitot tube failure of the AF447 flight is successfully detected and corrected after 5 seconds from the beginning of the failure. Overall error mode detection accuracy reaches 96.31%.Furthermore, our simulations show that the system never corrects data incorrectly, i.e., all inaccurate mode detections produce either unknown or unrecoverable errors. |
Detecting and recovering from errors in data streams is paramount to developing successful autonomous real-time streaming applications. In this paper, we devise a multi-modal data error detection and recovery architecture to enable automated recovery from data errors in streaming applications based on available redundancy. We formally define error signatures as a way to identify classes of abnormal conditions and mode likelihood vectors as a quantitative discriminator of data stream condition modes. Finally, we design an extension to our own declarative programming language, PILOTS, to include error correction code. We define performance metrics for our approach, and evaluate the impact of monitored data window size and mode likelihood change threshold on the accuracy and responsiveness of our data-driven multi-modal error detection and correction software. Tragic accidents—such as Air France's flight from Rio de Janeiro to Paris in June 2009 killing all people on board— can be prevented by implementing auto-pilot systems with an airspeed data stream error detection and correction algorithm following the fundamental principles illustrated in this work. |
The actor model of distributed computing imposes important restrictions on concurrent computations in order to be valid. In particular, an actor language implementation must provide fairness, the property that if a system transition is infinitely often enabled, the transition must eventually happen. Fairness is fundamental to proving progress properties. We show that many properties of actor computation can be expressed and proved at an abstract level, independently of the details of a particular system of actors. As in abstract algebra, we formulate and prove theorems at the most abstract level possible, so that they can be applied at all more refined levels of the theory hierarchy. Our most useful abstract-level theorems concern persistence of actors, conditional persistence of messages, preservation of unique actor identifiers, monotonicity properties of actor local states, guaranteed message delivery, and general consequences of fairness. We apply the general actor theory to a concrete ticker and clock actor system, proving several system-specific properties, including conditional invariants and a progress theorem. We develop our framework within the Athena proof system, in which proofs are both human-readable and machine-checkable, taking advantage of it library of algebraic and relational theories. |
2012 |
We present the Cloud Operating System (COS), a middleware framework to support autonomous workload elasticity and scalability based on application-level migration as a reconfiguration strategy. While other scalable frameworks (e.g., MapReduce or Google App Engine) force application developers to write programs following specific APIs, COS provides scalability in a general-purpose programming framework based on an actor-oriented programming language. When all executing VMs are highly utilized, COS scales a workload up by migrating mobile actors over to newly dynamically created VMs. When VM utilization drops, COS scales the workload down by consolidating actors and terminating idle VMs. Application-level migration is advantageous compared to VM migration especially in hybrid clouds in which migration costs over the Internet are critical to scale out the workloads. We demonstrate the general purpose programming approach using a tightly-coupled computation. We compare the performance of autonomous (i.e., COS-driven) versus ideal reconfiguration, as well as the impact of granularity of reconfiguration, i.e., VM migration versus application-level migration. Our results show promise for future fully automated cloud computing resource management systems that efficiently enable truly elastic and scalable general-purpose workloads. |
In this paper, we describe a programming model to enable reasoning about spatio-temporal data streams. A spatio-temporal data stream is one where each datum is related to a point in space and time. For example, sensors in a plane record airspeeds (va) during a given flight. Similarly, GPS units record an airplane's flight path over the ground including ground speeds (vg) at different locations. An aircraft's airspeed and ground speed are related by the following mathematical formula:, where va and αa are the aircraft airspeed and heading, and vw and αw are the wind speed and direction. Wind speeds and directions are typically forecast in 3,000-foot height intervals over discretely located ï¬x points in 6-12 hour ranges. Modeling the relationship between these spatio-temporal data streams allows us to estimate with high probability the likelihood of sensor failures and consequent erroneous data. Tragic airplane accidents (such as Air France's Flight 447 on June 1st, 2009 killing all 216 passengers and 12 aircrew aboard) could have been avoided by giving pilots better information which can be derived from inferring stochastic knowledge about spatio-temporal data streams. This work is a ï¬rst step in this direction |
In this paper, we describe the design and implementation of PILOTS, a ProgrammIng Language for spatiO-Temporal data Streaming applications. Using PILOTS, application developers can easily program an application that handles spatio-temporal data streams by writing a high-level declarative program specification. Whereas spatio-temporal data is available with various spatial density and time frequency depending on data sources (e.g., weather forecast data can be given hourly/-daily for a vast geographic area, while GPS data can be given every second or at a higher frequency for a specific geographic location), applications often need to process data at a constant frequency. To view such heterogeneous data streams as homogeneous data streams, PILOTS specifically provides first-class support for data selection and interpolation so that applications can get data consistently regardless of the data's original spatio-temporal heterogeneity. To enable reasoning about errors in correlated spatio-temporal data streams, we introduce the notion of error signatures, patterns in output data streams that appear when input data is erroneous. These patterns are produced thanks to a mathematical model that explicitly specifies the redundancy exhibited in the input data. PILOTS applications readily produce error signatures, which can be an important tool to semi-automatically detect data error conditions and enable better decision support systems. As a motivating application, we illustrate a PILOTS program that receives as input data: the airspeed, the ground speed, and the wind speed for a flight. We then compute the error signatures exhibited by failing the airspeed data stream simulating a pitot tube icing scenario (such as the one occurring in Air France flight 447 in June 2009 ultimately killing all people onboard), and by failing the ground speed data stream simulating a GPS constellation shutdown. |
The actor model of distributed computing imposes important restrictions on valid concurrent computations including fairness. We show that many properties of the model can be expressed and proved at an abstract level, independently of the details of a particular system of actors, in a logical framework in which proofs are both human-readable and machine-checkable. The framework, Athena, is briefly overviewed, with emphasis on its support for abstraction and specialization. A key contribution is the conceptual organization we develop: a richly structured hierarchy of formal theories that can be used to represent and reason about actor systems. Within this conceptual framework, we identify and prove a number of useful abstract-level theorems, including persistence of actors, preservation of unique actor identifiers, and general consequences of fairness. We also combine the general actor theory with a concrete ticker and clock actor system to prove several system-specific properties, including a progress theorem that depends on fairness. |
Smartphones have become very popular. While people enjoy various kinds of applications, some computation-intensive applications cannot be run on the smartphones since their computing power and battery life are still limited. We tackle this problem from two categories of applications. Applications in the first category are single-purpose and moderately slow (more than 10 seconds) to process on a single smartphone, but can be processed reasonably quickly by offloading a single module to a single computer (e.g., a face detection application). Applications in the second category are extremely slow (a few minutes to hours) to process on a single smartphone, but their execution time can be dramatically reduced by offloading computationally heavy modules to multiple computers (e.g., a face recognition application). In this category of applications, management of the server-side computation becomes more important since the intensity of computation is much stronger than in the first category. For the first category, we propose a light-weight task offloading method using runtime profiling, which predicts the total processing time based on a simple linear model and offloads a single task to a single server depending on the current performance of the network and server. Since this model’s simplicity greatly reduces the profiling cost at run-time, it enables users to start using an application without pre-computing a performance profile. Using our method, the performance of face detection for an image of 1.2 Mbytes improved from 19 seconds to 4 seconds. For the second category, we present a middleware framework called the Cloud Operating System (COS) as a back-end technology for smartphones. COS implements the notion of virtual machine (VM) malleability to enable cloud computing applications to effectively scale up and down. Through VM malleability, virtual machines can change their granularity by using split and merge operations. We accomplish VM malleability efficiently by using application-level migration as a reconfiguration strategy. Our experiments with a tightly-coupled computation show that a completely application-agnostic automated load balancer performs almost the same as human-driven VM-level migration; however, human-driven application-level migration outperforms (by 14% in our experiments) human-driven VM-level migration. These results are promising for future fully automated cloud computing resource management systems that efficiently enable truly elastic and scalable workloads. |
2011 |
Metaschedulers co-allocate resources by requesting a fixed number of processors and usage time for each cluster. These static requests, defined by users, limit the initial scheduling and prevent rescheduling of applications to other resource sets. It is also difficult for users to estimate application execution times, especially on heterogeneous environments. To overcome these problems, metaschedulers can use performance predictions for automatic resource selection. This paper proposes a resource co-allocation technique with rescheduling support based on performance predictions for multi-cluster iterative parallel applications. Iterative applications have been used to solve a variety of problems in science and engineering, including large-scale computations based on the asynchronous model more recently. We performed experiments using an iterative parallel application, which consists of benchmark multiobjective problems, with both synchronous and asynchronous communication models on Grid’5000. The results show run time predictions with an average error of 7\% and prevention of up to 35\% and 57\% of run time overestimations to support rescheduling for synchronous and asynchronous models, respectively. The performance predictions require no application source code access. One of the main findings is that as the asynchronous model masks communication and computation, it requires no network information to predict execution times. By using our co-allocation technique, metaschedulers become responsible for run time predictions, process mapping, and application rescheduling; releasing the user from these burden tasks. |
Effective visualization is critical to developing, analyzing, and optimizing distributed systems. We have developed OverView, a tool for online/offline distributed systems visualization, that enables modular layout mechanisms, so that different distributed system high-level programming abstractions such as actors or processes can be visualized in intuitive ways. OverView uses by default a hierarchical concentric layout that distinguishes entities from containers allowing migration patterns triggered by adaptive middleware to be visualized. In this paper, we develop a force-directed layout strategy that connects entities according to their communication patterns in order to directly exhibit the application communication topologies. In force-directed visualization, entities’ locations are encoded with different colors to illustrate load balancing. We compare these layouts using quantitative metrics including communication to entity ratio, applied on common distributed application topologies. We conclude that modular visualization is necessary to effectively visualize distributed systems since no one layout is best for all applications. |
Volunteer computing grids offer super-computing levels of computing power at the relatively low cost of operating a server. In previous work, the authors have shown that it is possible to take traditionally iterative evolutionary algorithms and execute them on volunteer computing grids by performing them asynchronously. The asynchronous implementations dramatically increase scalability and decrease the time taken to converge to a solution. Iterative and asynchronous optimization algorithms implemented using MPI on clusters and supercomputers, and BOINC on volunteer computing grids have been packaged together in a framework for generic distributed optimization (FGDO). This paper presents a new extension to FGDO for an asynchronous Newton method (ANM) for local optimization. ANM is resilient to heterogeneous, faulty and unreliable computing nodes and is extremely scalable. Preliminary results show that it can converge to a local optimum significantly faster than conjugate gradient descent does. |
This work describes research done by the MilkyWay@Home project to use N-Body simulations to model the formation of the Milky Way Galaxy's halo. While there have been previous efforts to use N-Body simulations to perform astronomical modeling, to our knowledge this is the first to use evolutionary algorithms to discover the initial parameters to the N-Body simulations so that they accurately model astronomical data. Performing a single 32,000 body simulation can take up to 200 hours on a typical processor, with an average of 15 hours. As optimizing the input parameters to these N-Body simulations typically takes at least 30,000 or more simulations, this work is made possible by utilizing the computing power of the 35,000 volunteered hosts at the MilkyWay@Home project, which are currently providing around 800 teraFLOPS. This work also describes improvements to an open-source framework for generic distributed optimization (FGDO), which provide more efficient validation in performing these evolutionary algorithms in conjunction the Berkeley Open Infrastructure for Network Computing (BOINC). |
Applications on smartphones are extremely popular as users can download and install them very easily from a service provider's application repository. Most of the applications are thoroughly tested and verified on a target smartphone platform; however, some applications could be very computationally intensive and overload the smartphone's resource capability. In this paper, we describe a method to predict the total processing time when offloading part of an application from smartphones to nearby servers. In our method, if an application developercan (1) define a basic model of the problem (e. g., f(x) = ax+b) and (2) implement an algorithm to update the model (e. g., least squares method), the application quickly adjusts the parameters of the model and minimizes the difference between predicted and measured performance adaptively. This accurate prediction helps dynamically determine whether or not it is worth offloading tasks and the expected performance improvement. Since this model's simplicity greatly reduces the time required for profiling the performance of the application at run-time, it enables users to start using an application without pre-computing a performance profile. Our experiments show that our update parameter protocol for the performance prediction functions works sufficiently well for a face detection problem. The protocol requires on average 7.8 trials to update prediction parameters, and the prediction error stays less than 10% for the rest of the trials. By offloading the face detection task to a nearby server for an image of 1.2Mbytes, the performance improved from 19 seconds to 4 seconds. This research opens up the possibility of new applications in real-time smartphone data processing by harnessing nearby computational resources. |
Cloud computing brings significant benefits for service providers and users because of its characteristics: \emph{e.g.}, on demand, pay for use, scalable computing. Virtualization management is a critical task to accomplish effective sharing of physical resources and scalability. Existing research focuses on live Virtual Machine (VM) migration as a workload consolidation strategy. However, the impact of other virtual network configuration strategies, such as optimizing total number of VMs for a given workload, the number of virtual CPUs (vCPUs) per VM, and the memory size of each VM has been less studied. This paper presents specific performance patterns on different workloads for various virtual network configuration strategies. For loosely coupled CPU-intensive workloads, on an 8-CPU machine, with memory size varying from 512MB to 4096MB and vCPUs ranging from 1 to 16 per VM, 1, 2, 4, 8 and 16VMs configurations have similar running time. The prerequisite of this conclusion is that all 8 physical processors are occupied by vCPUs. For tightly coupled CPU-intensive workloads, the total number of VMs, vCPUs per VM, and memory allocated per VM, become critical for performance. We obtained the best performance when the ratio of the total number of vCPUs to processors is 2. Doubling the memory size on each VM, for example from 1024MB to 2048MB, gave us at most 15% improvement of performance when the ratio of total vCPUs to physical processors is 2. This research will help private cloud administrators decide how to configure virtual resources for given workloads to optimize performance. It will also help public cloud providers know where to place VMs and when to consolidate workloads to be able to turn on/off Physical Machines (PMs), thereby saving energy and associated cost. Finally it helps cloud service users decide what kind of and how many VM instances to allocate for a given workload and a given budget. |
Cloud computing brings significant benefits for service providers and service users because of its characteristics: e.g., on demand, pay for use, scalable computing. Virtualization management is a critical component to accomplish effective sharing of physical resources and scalability. Existing research focuses on live Virtual Machine (VM) migration as a VM consolidation strategy. However, the impact of other virtual network configuration strategies, such as optimizing total number of VMs for a given workload, the number of virtual CPUs (vCPUs) per VM, and the memory size of each VM has been less studied. This thesis presents specific performance patterns on different workloads for various virtual network configuration strategies. We conclude that, for loosely coupled CPU-intensive workloads, memory size and number of vCPUs per VM do not have significant performance effects. On an 8-CPU machine, with memory size varying from 512MB to 4096MB and vCPUs ranging from 1 to 16 per VM; 1, 2, 4, 8 and 16VM configurations have similar running time. The prerequisite of this conclusion is that all 8 physical processors be occupied by vCPUs. For tightly coupled CPU-intensive workloads, the total number of VMs, vCPUs per VM and memory allocated per VM become critical for performance. We obtained the best performance when the ratio of total number of vCPUs to processors is 2. Doubling memory size on each VM, for example from 1024MB to 2048MB, brings at most 15% improvement of performance when number of VMs is greater than 2. Based on the experimental results, we propose a framework and a threshold-based strategy set to dynamically refine virtualization configurations. The framework mainly contains three parts: resources monitor, virtual network configuration controller and scheduler, which are responsible for monitoring resource usage on both virtual and physical layers, controlling virtual resources distribution, and scheduling concrete reconfiguration steps respectively. Our reconfiguration approach consists of four strategies: VM migration and VM malleability strategies, which are at global level, vCPU tuning and memory ballooning, which are at local level. The strategies evaluate and trigger specific reconfiguration steps (for example, double the number of vCPUs on each VM) by comparing current allocated resources and corresponding utilizations with expected values. The evaluation experimental results of threshold-based strategy show that reconfiguration in global level works better for tightly coupled CPU-intensive workloads than for loosely coupled ones. Local reconfiguration including dynamically changing number of vCPUs and memory size allocated to VMs, improves the performance of initially sub-optimal virtual network configurations, even though it falls short of performing as well as the initially optimal virtual network configurations. This research will help private cloud administrators decide how to configure virtual resources for a given workload to optimize performance. It will also help service providers know where to place VMs and when to consolidate workloads to be able to turn on/off Physical Machines (PMs), thereby saving energy and associated costs. Finally it let service users know what kind of and how many VM instances to allocate in a public cloud for a given workload and budget. |
2010 |
Evolutionary algorithms (EAs) require large scale computing resources when tackling real world problems. Such computational requirement is derived from inherently complex fitness evaluation functions, large numbers of individuals per generation, and the number of iterations required by EAs to converge to a satisfactory solution. Therefore, any source of computing power can significantly benefit researchers using evolutionary algorithms. We present the use of volunteer computing (VC) as a platform for harnessing the computing resources of commodity machines that are nowadays present at homes, companies and institutions. Taking into account that currently desktop machines feature significant computing resources (dual cores, gigabytes of memory, gigabit network connections, etc.), VC has become a cost-effective platform for running time consuming evolutionary algorithms in order to solve complex problems, such as finding substructure in the Milky Way Galaxy, the problem we address in detail in this chapter. |
Computational science is placing new demands on optimization algorithms as the size of data sets and the computational complexity of scientific models continue to increase. As these complex models have many local minima, evolutionary algorithms (EAs) are very useful for quickly finding optimal solutions in these challenging search spaces. In addition to the complex search spaces involved, calculating the objective function can be extremely demanding computationally. Because of this, distributed computation is a necessity. In order to address these computational demands, top-end distributed computing systems are surpassing hundreds of thousands of computing hosts; and as in the case of Internet based volunteer computing systems, they can also be highly heterogeneous and faulty. This work examines asynchronous strategies for distributed EAs using simulated computing environments. Results show that asynchronous EAs can scale to hundreds of thousands of computing hosts while being highly resilient to heterogeneous and faulty computing environments, something not possible for traditional distributed EAs which require synchronization. While the simulation not only provides insight as to how asynchronous EAs perform on distributed computing environments with different latencies and heterogeneity, it also serves as a sanity check because live distributed systems require problems with high computation to communication ratios and traditional benchmark problems cannot be used for meaningful analysis due to their short computation times. |
Computational science is placing new demands on distributed computing systems as the rate of data acquisition is far outpacing the improvements in processor speed. Evolutionary algorithms provide efficient means of optimizing the increasingly complex models required by different scientific projects, which can have very complex search spaces with many local minima. This work describes different validation strategies used by MilkyWay@Home, a volunteer computing project created to address the extreme computational demands of 3-dimensionally modeling the Milky Way galaxy, which currently consists of over 27,000 highly heterogeneous and volatile computing hosts, which provide a combined computing power of over 1.55 petaflops. The validation strategies presented form a foundation for efficiently validating evolutionary algorithms on unreliable or even partially malicious computing systems, and have significantly reduced the time taken to obtain good fits of MilkyWay@Home’s astronomical models. |
Effective visualization is critical to developing, analyzing, and optimizing distributed systems. We have developed OverView, a tool for online/offline distributed systems visualization, that enables modular layout mechanisms, so that different distributed system high-level programming abstractions such as actors or processes can be visualized in intuitive ways. OverView uses by default a hierarchical concentric layout that distinguishes entities from containers allowing migration patterns triggered by adaptive middleware to be visualized. In this paper, we develop a force-directed layout strategy that connects entities according to their communication patterns in order to directly exhibit the application communication topologies. In force-directed visualization, entities’ locations are encoded with different colors to illustrate load balancing. We compare these layouts using quantitative metrics including communication to entity ratio, applied on common distributed application topologies. We conclude that modular visualization is necessary to effectively visualize distributed systems since no one layout is best for all applications. |
Recent studies of the Milky Way halo have shown that there are many tidal debris streams and other substructures that can be detected from the spatial distributions of halo stars. We are attempting to describe the spatial structure through maximum likelihood fitting of a smoothly varying component plus a set of additional components that describe the tidal debris. The Sagittarius tidal debris stream in particular is modeled by a set of piecewise linear cylinders with a density that falls of as a Gaussian from the central axis of the cylinder. We show the highest likelihood fit to the density of SDSS F turnoff stars along the Sagittarius stream, and the results of a test of the sensitivity of the likelihood fits to the angle between the stream direction and the angle at which the data is sliced into pieces. |
This paper studies the impact of VM granularity on workload performance in cloud computing environments. We use HPL as a representative tightly coupled computational workload and a web server providing content to customers as a representative loosely coupled network intensive workload. The performance evaluation demonstrates VM granularity has a significant impact on the performance of the computational workload. On an 8-CPU machine, the performance obtained from utilizing 8VMs is more than 4 times higher than that given by 4 or 16 VMs for HPL of problem size 4096; whereas on two machines with a total of 12 CPUs 24 VMs gives the best performance for HPL of problem sizes from 256 to 1024. Our results also indicate that the effect of VM granularity on the performance of the web system is not critical. The largest standard deviation of the transaction rates obtained from varying VM granularity is merely 2.89 with a mean value of 21.34. These observations suggest that VM malleability strategies where VM granularity is changed dynamically, can be used to improve the performance of tightly coupled computational workloads, whereas VM consolidation for energy savings can be more effectively applied to loosely coupled network intensive workloads. |
Large-scale distributed computing applications require concurrent programming models that support modular and compositional software development. The actor model supports the development of independent software components with its asynchronous message-passing communication and state encapsulation properties. Automatic actor garbage collection is necessary for high-level actor-oriented programming, but identifying live actors is not as intuitive and easy as identifying live passive objects in a reference graph. However, a transformation method can turn an actor reference graph into a passive object reference graph, which enables the use of passive object garbage collection algorithms and simplifies the problem of actor garbage collection. In this paper, we formally define potential communication by introducing two binary relations - the may-talk-to and the may-transitively-talk-to relations, which are then used to define the set of live actors. We also devise two vertex-preserving transformation methods to transform an actor reference graph into a passive object reference graph. We provide correctness proofs for the proposed algorithms. The experimental results also show that the proposed algorithms are efficient. |
2009 |
As the rates of data acquisition and cost of model evaluation in scientific computing are far surpassing improvements in processor speed, the size of the computing environments required to effectively perform scientific research is increasing dramatically. As these computing environments increase in size, traditional global optimization methods, which are sequential in nature, fail to adequately address the challenges of scalability, fault tolerance and heterogeneity that using these computing systems entails. This thesis introduces asynchronous optimization strategies which while similar to their traditional synchronous counterparts, do not have explicit iterations or dependencies. This allows them to scale to hundreds of thousands of hosts while not being degraded by faults or heterogeneity. A framework for generic distributed optimization (FGDO) is presented, which separates the concerns of scientific model development, distributed computing and developing efficient optimization strategies; allowing researchers to develop these independently and utilize them interoperably through simple interfaces. FGDO has been used to run these asynchronous optimization methods using an astroinformatics problem which calculates models of the Milky Way galaxy on thousands of processors in RPI’s BlueGene/L supercomputer and to run the MilkyWay@Home volunteer computing project, which currently consists of over 25,000 active computing hosts. A simulation environment was also implemented in FGDO, which allowed asynchronous optimization to be examined in a controlled setting with benchmark optimization problems. Results using the simulated environment show that the asynchronous optimization methods used scale to hundreds of thousands of computing hosts, while the traditional methods do not improve or even degrade as more computing hosts are added. Additionally, the asynchronous optimization methods are shown to be largely unaffected by increasing heterogeneity in the computing environment and also scale similarly in a computing environment modeled after MilkyWay@Home. This thesis presents strong evidence of the need for novel optimization methods for massive scale computing systems and provides effective initial work towards this goal. |
Malleability enables a parallel application's execution system to split or merge processes modifying granularity. While process migration is widely used to adapt applications to dynamic execution environments, it is limited by the granularity of the application's processes. Malleability empowers process migration by allowing the application's processes to expand or shrink following the availability of resources. We have implemented malleability as an extension to the process checkpointing and migration (PCM) library, a userâ€level library for iterative message passing interface (MPI) applications. PCM is integrated with the Internet Operating System, a framework for middlewareâ€driven dynamic application reconfiguration. Our approach requires minimal code modifications and enables transparent middlewareâ€triggered reconfiguration. Experimental results using a twoâ€dimensional data parallel program that has a regular communication structure demonstrate the usefulness of malleability. |
Modern surveys are producing enormous amounts of data that can only be navigated via the use of the ever increasing computational resources available. For example, the SDSS has taken a large amount of photometric data that can be used to discover and study substructure in the Galactic spheroid. A maximum likelihood method was developed and applied to color-selected F turnoff stars from two stripes of SDSS data, to determine the spatial characteristics of the Sagittarius dwarf tidal debris that exists within these stripes. The Sagittarius tidal debris in stripes 79 and 86 were detected at the positions (l,b,R) = (163.311 °, -48.400 °, 30.23 kpc) and (l,b,R) = (34.775 °, -72.342 °, 26.08 kpc) and were found to have a FWHM of 6.53±0.54 kpc and 5.71±0.26 kpc and also to contain ≈9,500 and ≈16,700 F turnoff stars, respectively. The debris pieces are axially aligned with the directions (^X,^Y,^Z) = (0.758 kpc, 0.254 kpc, -0.600 kpc) and (^X,^Y,^Z) = (0.982 kpc, 0.084 kpc, 0.167 kpc), respectively. The results of probabilistically separating the tidal debris from the stellar spheroid are also presented. |
General-Purpose computing on Graphics Processing Units (GPGPU) is an emerging field of research which allows software developers to utilize the significant amount of computing resources GPUs provide for a wider range of applications. While traditional high performance computing environments such as clusters, grids and supercomputers require significant architectural modifications to incorporate GPUs, volunteer computing grids already have these resources available as most personal computers have GPUs available for recreational use. Additionally, volunteer computing grids are gradually upgraded by the volunteers as they upgrade their hardware, whereas clusters, grids and supercomputers are typically upgraded only when replaced by newer hardware. As such, MilkyWay@Home’s volunteer computing system is an excellent testbed for measuring the potential of large scale distributed GPGPU computing across a large number of heterogeneous GPUs. This work discusses the implementation and optimization of the MilkyWay@Home client application for both Nvidia and ATI GPUs. A 17 times speedup was achieved for double-precision calculations on a Nvidia GeForce GTX 285 card, and a 109 times speedup for double-precision calculations on an ATI HD5870 card, compared to the CPU version running on one core of a 3.0 GHz AMD Phenom(tm)II X4 940. Using single-precision calculations was also evaluated which further increased performance 6.2 times for ATI card, and 7.8 times on the Nvidia card but with some loss of accuracy. Modifications to the BOINC infrastructure which enable GPU discovery and utilization are also discussed. The resulting software enabled MilkyWay@Home to use GPU applications for a significant increase in computing power, at the time of this publication approximately 216 teraflops, which would place the combined power of these GPUs between the 11th and 12th fastest supercomputers in the world. |
General-Purpose computing on Graphics Processing Units (GPGPU) is an emerging field of research which allows software developers to utilize the significant amount of computing resources GPUs provide for a wider range of applications. While traditional high performance computing environments such as clusters, grids and supercomputers require significant architectural modifications to incorporate GPUs, volunteer computing grids already have these resources available as most personal computers have GPUs available for recreational use. Additionally, volunteer computing grids are gradually upgraded by the volunteers as they upgrade their hardware, whereas clusters, grids and supercomputers are typically upgraded only when replaced by newer hardware. As such, MilkyWay@Home’s volunteer computing system is an excellent testbed for measuring the potential of large scale distributed GPGPU computing across a large number of heterogeneous GPUs. This work discusses the implementation and optimization of the MilkyWay@Home client application for both Nvidia and ATI GPUs. A 17 times speedup was achieved for double-precision calculations on a Nvidia GeForce GTX 285 card, and a 109 times speedup for double-precision calculations on an ATI HD5870 card, compared to the CPU version running on one core of a 3.0 GHz AMD Phenom(tm)II X4 940. Using single-precision calculations was also evaluated which further increased performance 6.2 times for ATI card, and 7.8 times on the Nvidia card but with some loss of accuracy. Modifications to the BOINC infrastructure which enable GPU discovery and utilization are also discussed. The resulting software enabled MilkyWay@Home to use GPU applications for a significant increase in computing power, at the time of this publication approximately 216 teraflops, which would place the combined power of these GPUs between the 11th and 12th fastest supercomputers in the world. |
In this paper, we present a framework that enables scientists to steer computations executing over large-scale grid computing environments. By using computational steering, users can dynamically control their simulations or computations to reach expected results more efficiently. The framework supports steerable applications by introducing an asynchronous iterative MapReduce programming model that is deployed using Hadoop over a set of virtual machines executing on a multi-cluster grid. To tolerate the heterogeneity between different sites, results are collected asynchronously and users can dynamically interact with their computations to adjust the area of interest. According to users' dynamic interaction, the framework can redistribute the computational overload between the heterogeneous sites and explore the user's interest area by using more powerful sites when possible. With our framework, the bottleneck induced by synchronisation between different sites is considerably avoided, and therefore the response to users' interaction is satisfied more efficiently. We illustrate and evaluate this framework with a scientific application that aims to fit models of the Milky Way galaxy structure to stars observed by the Sloan Digital Sky Survey. |
Volunteer computing grids offer significant computing power at relatively low cost to researchers, while at the same time generating public interest in different scientific projects. However, in order to be used effectively, their heterogeneity, volatility and restrictive computing models must be overcome. As these computing grids are open, incorrect or malicious results must also be handled. This paper examines extending the BOINC volunteer computing framework to allow for asynchronous global optimization as applied to scientific computing problems. The asynchronous optimization method used is resilient to faults and the heterogeneous nature of volunteer computing grids, while allowing scalability to tens of thousands of hosts. A work verification strategy that does not require the validation of every result is presented. This is shown to be able to effectively reduce the need for verification done to less than 30% of the reported results, without degrading the performance of the asynchronous search methods. An asynchronous version of particle swarm optimization (APSO) is presented and com- pared to previously used asynchronous genetic search (AGS) using the MilkyWay@Home BOINC computing project. Both search methods are shown to scale to MilkyWay@Home's current user base, over 75,000 heterogeneous and volatile hosts, something not possible for traditional optimization methods. APSO is shown to provide faster convergence to optimal results while being less sensitive to its search parameters. The verification strategy presented is shown to be effective for both AGS and APSO. |
2008 |
We present a maximum likelihood method for determining the spatial properties of tidal debris and of the Galactic spheroid. With this method we characterize Sagittarius debris using stars with the colors of blue F turnoff stars in SDSS stripe 82. The debris is located at ( \alpha ,\delta ,R) = ( 31.37^{\circ }\pm 0.26^{\circ },0.0^{\circ },29.22\pm 0.20\ \mathrm{kpc}\,) , with a (spatial) direction given by the unit vector ( -0.991\pm 0.007\ \mathrm{kpc}\,,0.042\pm 0.033\ \mathrm{kpc}\,,0.127\pm 0.046\ \mathrm{kpc}\,) , in galactocentric Cartesian coordinates, and with \mathrm{FWHM}\, = 6.74\pm 0.06 kpc. This 2.5° wide stripe contains 0.9% as many F turnoff stars as the current Sagittarius dwarf galaxy. Over small spatial extent, the debris is modeled as a cylinder with a density that falls off as a Gaussian with distance from the axis, while the smooth component of the spheroid is modeled with a Hernquist profile. We assume that the absolute magnitude of F turnoff stars is distributed as a Gaussian, which is an improvement over previous methods which fixed the absolute magnitude at \overline{M}_{g_{0}} = 4.2 . The effectiveness and correctness of the algorithm is demonstrated on a simulated set of F turnoff stars created to mimic SDSS stripe 82 data, which shows that we have a much greater accuracy than previous studies. Our algorithm can be applied to divide the stellar data into two catalogs: one which fits the stream density profile and one with the characteristics of the spheroid. This allows us to effectively separate tidal debris from the spheroid population, both facilitating the study of the tidal stream dynamics and providing a test of whether a smooth spheroidal population exists. |
Large scale surveys are providing vast amounts of data that can help us understand and study tidal debris more easily and accurately. A maximum likelihood method for determining the spatial properties of this tidal debris and the stellar Galactic spheroid has been developed to take advantage of these huge datasets. We present the results of studying the Sagittarius dwarf tidal stream in two SDSS stripes taken in the southern Galactic Cap using this method. This study was done using stars with the colors of blue F turnoff stars in SDSS. We detected Sagittarius debris at the positions (l,b,R) = (163.311°,−48.400°,30.23kpc) and (l,b,R) = (34.775°,−72.342°,26.08kpc). These debris pieces were found to have a FWHM of 6.53±0.54kpc and 5.71±0.26kpc and also to contain ≈9,500 and ≈16,700 F turnoff stars, respectively. The debris pieces were also found to have (spatial) directions of (XÌ‚,Ŷ,áº) = (0.758,0.254,â€0.600) and (XÌ‚,Ŷ,áº) = (0.982,0.945,0.167), respectively. Using the results of the algorithm, we have also probabilistically separated the tidal debris from the stellar spheroid and present those results as well. |
This paper examines the use of a probabilistic simplex operator for asynchronous genetic search on the BOINC volunteer computing framework. This algorithm is used to optimize a computationally intensive function with a continuous parameter space: finding the optimal fit of an astronomical model of the Milky Way galaxy to observed stars. The asynchronous search using a BOINC community of over 1,000 users is shown to be comparable to a synchronous continuously updated genetic search on a 1,024 processor partition of an IBM BlueGene/L supercomputer. The probabilistic simplex operator is also shown to be highly effective and the results demonstrate that increasing the parents used to generate offspring improves the convergence rate of the search. Additionally, it is shown that there is potential for improvement by refining the range of the probabilistic operator, adding more parents, and generating offspring differently for volunteered computers based on their typical speed in reporting results. The results provide a compelling argument for the use of asynchronous genetic search and volunteer computing environments, such as BOINC, for computationally intensive optimization problems and, therefore, this work opens up interesting areas of future research into asynchronous optimization methods. |
Use of large-scale heterogeneous computing environments such as computational grids and the Internet has become of high interest to scientific researchers. This is because the increasing complexity of their scientific models and data sets is drastically outpacing the increases in processor speed while the cost of supercomputing environments remains relatively high. However, the heterogeneity and unreliability of these environments, especially the Internet, make scalable and fault tolerant search methods indispensable to effective scientific model verification. The paper introduces two versions of asynchronous master-worker genetic search and evaluates their convergence and performance rates in comparison to traditional synchronous genetic search on both a IBM BlueGene supercomputer and using the MilkyWay@HOME BOINC Internet computing project 1. The asynchronous searches not only perform faster on heterogeneous grid environments as compared to synchronous search, but also achieve better convergence rates for the astronomy model used as the driving application, providing a strong argument for their use on grid computing environments and by the Milky Way@Home BOINC Internet computing project. |
Distributed systems, due to their inherent complexity and nondeterministic nature, are programmed using high-level abstractions, such as processes, actors, ambients, agents, or services. There is a need to provide tools which allow developers to better understand, test, and debug distributed systems. OverView is a software toolkit which allows online and offline visualization of distributed systems through the concepts of entities and containers, which preserve the abstractions used at the programming level and display important dynamic properties, such as temporal (that is, when entities are created and deleted), spatial (that is, entity location and migration events) and relational (that is, entity containment or communication patterns). In this paper, we introduce two general layout mechanisms to visualize distributed systems: a hierarchical concentric layout that places containers and entities in a ring of rings, and an organic layout that uses the dynamic properties of the system to co-locate entities. We define visualization quality metrics such as intuitiveness, scalability, and genericity, and use them to evaluate the visualization layouts for several application communication topologies including linked lists, trees, hypercubes, and topologies arising from structured overlay networks such as Chord rings. |
The transactor model is an extension of the actor model designed to tolerate failures in distributed systems. Transactors can provide guarantees about consistency of a distributed system's state in the face of message loss and temporary failures of computing nodes. The model introduces dependency information and a two-phase checkpointing protocol. The added dependency information enables transactors to track the interdependencies caused by communications between actors, making it possible to ensure that the state of the distributed system may contain machines which are not consistent with one another, the transactor model keeps track of the interdependencies between these machines, ensuring that such machines will roll back to a previous state if necessary in order to maintain consistency. Thus, the system will move from globally consistent state to globally consistent state |
2007 |
Advances in hardware technologies are constantly pushing the limits of processing, networking, and storage resources, yet there are always applications whose computational demands exceed even the fastest technologies available. It has become critical to look into ways to efficiently aggregate distributed resources to benefit a single application. Achieving this vision requires the ability to run applications on dynamic and heterogeneous environments such as grids and shared clusters. New challenges emerge in such environments, where performance variability is the rule and not the exception, and where the availability of the resources can change anytime. Therefore, applications require the ability to dynamically reconfigure to adjust to the dynamics of the underlying resources. To realize this vision, we have developed the Internet Operating System (IOS), a framework for middleware-driven application reconfiguration in dynamic execution environments. Its goal is to provide high performance to individual applications in dynamic settings and to provide the necessary tools to facilitate the way in which scientific and engineering applications interact with dynamic environments and reconfigure themselves as needed. Reconfiguration in IOS is triggered by a set of decentralized agents that form a virtual network topology. IOS is built modularly to allow the use of different algorithms for agents’ coordination, resource profiling, and reconfiguration. IOS exposes generic APIs to high-level applications to allow for interoperability with a wide range of applications. We investigated two representative virtual topologies for inter-agent coordination: a peer-to-peer and a cluster-to-cluster topology |
Iterative applications are known to run as slow as their slowest computational component. This paper introduces malleability, a new dynamic reconfiguration strategy to overcome this limitation. Malleability is the ability to dynamically change the data size and number of computational entities in an application. Malleability can be used by middleware to autonomously reconfigure an application in response to dynamic changes in resource availability in an architecture-aware manner, allowing applications to optimize the use of multiple processors and diverse memory hierarchies in heterogeneous environments. The modular Internet Operating System (IOS) was extended to reconfigure applications autonomously using malleability. Two different iterative applications were made malleable. The first is used in astronomical modeling, and representative of maximum-likelihood applications was made malleable in the SALSA programming language. The second models the diffusion of heat over a two dimensional object, and is representative of applications such as partial differential equations and some types of distributed simulations. Versions of the heat application were made malleable both in SALSA and MPI. Algorithms for concurrent data redistribution are given for each type of application. Results show that using malleability for reconfiguration is 10 to 100 times faster on the tested environments. The algorithms are also shown to be highly scalable with respect to the quantity of data involved. While previous work has shown the utility of dynamically reconfigurable applications using only computational component migration, malleability is shown to provide up to a 15% speedup over component migration alone on a dynamic cluster environment. This work is part of an ongoing research effort to enable applications to be highly reconfigurable and autonomously modifiable by middleware in order to efficiently utilize distributed environments. Grid computing environments are becoming increasingly heterogeneous and dynamic, placing new demands on applications’ adaptive behavior. This work shows that malleability is a key aspect in enabling effective dynamic reconfiguration of iterative applications in these environments. |
This paper presents GMLE 1, a generic and distributed framework for maximum likelihood evaluation. GMLE is currently being applied to astroinformatics for determining the shape of star streams in the Milky Way galaxy, and to particle physics in a search for theory-predicted but yet unobserved sub-atomic particles. GMLE is designed to enable parallel and distributed executions on platforms ranging from supercomputers and high-performance homogeneous computing clusters to more heterogeneous Grid and Internet computing environments. GMLE's modular implementation seperates concerns of developers into the distributed evaluation frameworks, scientific models, and search methods, which interact through a simple API. This allows us to compare the benefits and drawbacks of different scientific models using different search methods on different computing environments. We describe and compare the performance of two implementations of the GMLE framework: an MPI version that more effectively uses homogeneous environments such as IBM's BlueGene, and a SALSA version that more easily accommodates heterogeneous environments such as the Rensselaer Grid. We have shown GMLE to scale well in terms of computation as well as communication over a wide range of environments. We expect that scientific computing frameworks, such as GMLE, will help bridge the gap between scientists needing to analyze ever larger amounts of data and ever more complex distributed computing environments. |
Malleability enables a parallel application's execution system to split or merge processes modifying granularity. While process migration is widely used to adapt applications to dynamic execution environments, it is limited by the granularity of the application's processes. Malleability empowers process migration by allowing the application's processes to expand or shrink following the availability of resources. We have implemented malleability as an extension to the PCM (process checkpointing and migration) library, a user-level library for iterative MPI applications. PCM is integrated with the Internet operating system (IOS), a framework for middleware-driven dynamic application reconfiguration. Our approach requires minimal code modifications and enables transparent middleware- triggered reconfiguration. Experimental results using a two-dimensional data parallel program that has a regular communication structure demonstrate the usefulness of malleability. |
Research scientists increasingly turn to large-scale heterogeneous environments such as computational grids and the Internet based facilities to satisfy their rapidly growing computational needs. The increasing complexity of the scientific models and rapid collection of new data are drastically outpacing the advances in processor speed while the cost of supercomputing environments remains relatively high. However, the heterogeneity and unreliability of these environments, especially the Internet, make scalable and fault tolerant search methods indispensable to effective scientific model verification. An effective search method for these types of environments is asynchronous genetic search, where a population continuously evolves based on asynchronously generated and received results. However, it is unclear what effect heterogeneity has on this type of search. For example, results received from slower workers may turn out to be obsolete or less beneficial than results calculated by faster workers. This paper examines the effect of heterogeneity on asynchronous panmictic (single population) genetic search for two different scientific applications, one used by astronomers to model the Milky Way galaxy and another by particle physicists to determine the existence of theory predicted, yet unobserved particles such as missing baryons. Results show that for both applications results received from slower workers while overall less beneficial are still useful. Additionally, a modification of asynchronous genetic search shows that different parameter generation strategies change their effectiveness over the course of the search. |
With the emergence of Internet and mobile computing, a wide range of Internet applications have introduced new demands for openness, portability, highly dynamic reconfiguration, and the ability to adapt quickly to changing execution environments. Current programming languages and systems lack support for dynamic reconfiguration of applications, where application entities get moved to different processing nodes at run-time. Java has provided support for dynamic web content through applets, network class loading, bytecode verification, security, and multi-platform compatibility. Moreover, Java is a good framework for distributed Internet programming because of its standardized representation of objects and serialization support. Some of the important libraries that provide support for Internet computing are: java.rmi for remote method invocation, java.reflection for run-time introspection, java.io for serialization, and java.net for sockets, datagrams, and URLs. SALSA (Simple Actor Language, System and Architecture) is an actororiented programming language designed and implemented to introduce the benefits of the actor model while keeping the advantages of object-oriented programming. Abstractions include active objects, asynchronous message passing, universal naming, migration, and advanced coordination constructs for concurrency. SALSA is pre-processed into Java and preserves many of Java’s useful object oriented conceptsmainly, encapsulation, inheritance, and polymorphism. SALSA abstractions enable the development of dynamically reconfigurable applications. A SALSA program consists of universal actors that can be migrated around distributed nodes at runtime. |
Automatic garbage collection (GC) gives abstraction to distributed application development, promoting code quality and improving resource management. Unreachability of active objects or actors from the root set is not a sufficient condition to collect actor garbage, making passive object GC algorithms unsafe when directly used on actor systems. In practical actor languages, all actors have references to the root set since they can interact with users, e.g., through standard input or output streams. This feature makes every unblocked actor live, and thus we call it the live unblocked actor principle. Following this idea, we introduce pseudo-roots: a dynamic set of actors that can be viewed as the root set. Pseudo-roots use protected (undeletable) references to ensure that no actors are erroneously collected even with messages in transit. The pseudoroot approach simplifies distributed actor garbage collection by representing in-transit messages in the actor reference graph even if both application and system messages are unordered (nonFIFO) and communication is asynchronous. Our algorithm also supports systems not following the live unblocked actor principle and therefore, is applicable to more traditional actor garbage collection. We formalize the computing model of the pseudo-root approach and provide proofs of correctness. Furthermore, we summarize empirical results on the performance and scalability of distributed GC using a particle physics maximum likelihood evaluation fitter on a 72-processor cluster environment. |
As distributed computing environments increase in size by adding more participants, their computing architectures become more heterogeneous and dynamically available, while the chance of hardware failure increases. In order to efficiently develop applications for these environments, new programming methods and services must be made accessible which enable dynamic reconfiguration and fault tolerance. As part of an effort to provide these services and methods, this thesis examines two reconfiguration methods that allow applications to dynamically respond to changes in their environment. Migration allows for the computational components of an application to move between processors, while malleability enables a more general reshaping of an application by adding or removing computational components and redistributing data. Migration and malleability have been implemented in the SALSA programming language. Migration is provided transparently without any effort required from a developer, while malleability is accomplished through a simple API. The Internet Operating System (IOS), a modular middleware, is used to autonomously reconfigure applications which implement these reconfiguration methods. |
2006 |
Distributed actor garbage collection (GC) is a notoriously hard problem due to the nature of distributed actor systems — no common clock, no coherent global information, asynchronous and unordered message passing, autonomous behavior of actors, and counter-intuitive actor marking to identify live actors. Most existing distributed actor GC algorithms rely on First-In-First-Out (FIFO) communication, which constrains the actor model and is impractical with actor mobility; others depend on stop-the-world synchronization, which is intrusive and impractical for users’ computations. Existing actor GC algorithms ignore actor mobility and resource access restrictions. Existing distributed passive object GC algorithms cannot be directly reused because of the different semantics of passive objects and actors. To overcome the problems that existing algorithms cannot solve, this thesis presents a practical actor GC mechanism for distributed mobile actor systems. Our approach starts formalizing garbage actors, and then we show two different but similar transformation methods that prove the equivalence of actor GC and passive object GC. Two actor marking algorithms are derived from the transformation methods — the back pointer algorithm and the N-color algorithm. The back pointer algorithm has linear time complexity of O(E + V ) and extra space complexity of O(E + V ), and the N-color algorithm has time complexity of O(E lg∗ M) and extra space complexity of O(M), given that E is the number of references, V , the number of actors, and M, the number of unblocked and root actors. The N-color algorithm only requires scanning the reference graph once while the back pointer algorithm requires scanning it twice.The thesis follows by describing our distributed mobile actor GC mechanism. It consists of 1) an asynchronous, non-FIFO reference listing based algorithm which supports hierarchical GC (local and global GC), 2) a new fault-tolerant, distributed snapshot-based algorithm which collects cyclic and acyclic garbage in a partial set of computing hosts, and 3) formal models and correctness proofs. Experimental results have confirmed that our approach is practical and scalable. |
Large-scale, dynamic, and heterogeneous networks of computational resources (a.k.a. grids) promise to provide high performance and scalability to computationally intensive applications. To fulfill this promise, grid environments require complex resource management. We propose decentralized middleware-triggered dynamic reconfiguration strategies to enable application adaptation to the constantly changing resource availability of Internet-scale shared computational grids. As a proof of concept, we present a software framework for dynamically reconfigurable distributed applications. The Internet Operating System (IOS) is a middleware infrastructure which aims at freeing application developers from dealing with non-functional concerns while seeking to optimize application performance and global resource utilization. IOS consists of distributed middleware agents that are capable of interconnecting themselves in various virtual peer-to-peer topologies. IOS middleware agents: 1) profile application communication patterns; 2) evaluate the dynamics of the underlying physical resources; and 3) reconfigure application components by changing their mappings to physical resources through migration and by changing their granularity through a split and merge mechanism. A key characteristic of IOS is its decentralized coordination, thereby avoiding the use of global knowledge and thus enabling scalable reconfiguration. The IOS middleware is programming model-independent: we have implemented an actor programming model interface for SALSA programs and also a process programming model interface for MPI programs. Experimental results show that adaptive middleware can be an effective approach to reconfiguring distributed applications with various ratios of communication to computation in order to improve their performance, and more effectively utilize grid resources. |
Modern parallel scientific computation is being performed in a wide variety of computational environments that include clusters, large-scale supercomputers, grid environments, and metacomputing environments. This presents challenges for application and library developers, who must develop architecture-aware software if they wish to utilize several computing platforms efficiently. Architecture-aware computation can be beneficial in single-processor environments. It takes the form of something as common as an optimizing compiler, which will optimize software for a target computer. Application and library developers may adjust data structures or memory management techniques to improve cache utilization on a particular system. |
This paper describes a modular decentralized middleware framework for dynamic application reconfiguration in large scale heterogeneous environments. Component malleability is presented as a dynamic reconfiguration strategy. Dynamic component granularity enables improved load balancing by component migration, and most importantly, it enables applications to scale up to arbitrarily large numbers of processing nodes in a relatively transparent way. Preliminary experimental results show that without significant overhead, malleable components can improve applications performance and distributed resource utilization. |
Today’s enterprise data centers support thousands of mission-critical business applications composed of multiple distributed heterogeneous components. Application components exhibit complex dependencies on the configuration of multiple data center network, middleware, and related application resources. Applications are also associated with extended life-cycles, migrating from development to testing, staging and production environments, with frequent roll-backs. Maintaining end-to-end data center operational integrity and quality requires careful planning of (1) application deployment design, (2) resource selection, (3) provisioning operation selection, parameterization and ordering, and (4) provisioning operation execution. Current data center management products are focused on workflow-based automation of the deployment processes. Workflows are of limited value because they hard-code many aspects of the process, and are thus sensitive to topology changes. An emerging and promising class of model-based tools is providing new methods for designing detailed deployment topologies based on a set of requirements and constraints. In this paper we describe an approach to bridging the gap between generated “desired state†models and the elemental procedural provisioning operations supported by data center resources. In our approach, we represent the current and desired state of the data center using object models. We use AI planning to automatically generate workflows that bring the data center from its current state to the desired state. We discuss our optimizations to Partial Order Planning algorithms for the provisioning domain. We validated our approach by developing and integrating a prototype with a state of the art provisioning product. We also present initial results of a performance study. |
We have designed a maximum likelihood fitter using the actor model to distribute the computation over a heterogeneous network. The prototype implementation uses the SALSA programming language and the Internet Operating System middleware. We have used our fitter to perform a partial wave analysis of particle physics data. Preliminary measurements have shown good performance and scalability. We expect our approach to be applicable to other scientific domains, such as biology and astronomy, where maximum likelihood evaluation is an important technique. We also expect our performance results to scale to Internet-wide runtime infrastructures, given the high adaptability of our software framework. |
Automatic distributed garbage collection (GC) gives abstraction to grid application development, promoting code quality and improving resource management. Unreachability of active objects or actors from the root set is not a sufficient condition to collect actor garbage, making passive object GC algorithms unsafe when directly used on actor systems. In practical actor languages, all actors have references to the root set since they can interact with users, e.g., through standard input or output streams. Based on this observation, we introduce pseudo roots: a dynamic set of actors that can be viewed as the root set. Pseudo roots use protected (undeletable) references to ensure that no actors are erroneously collected even with messages in transit. Following this idea, we introduce a new direction of actor GC, and demonstrate it by developing a distributed GC framework. The framework can thus be used for automatic life time management of mobile reactive processes with unordered asynchronous communication. |
Distributed actor garbage collection differs from distributed object garbage collection in that it needs to consider in-transit message detection, unordered message reception, and actor migration. In this paper, we propose a new snapshot-based distributed actor garbage collection algorithm. The algorithm does not require First-In-First-Out or blocking communication, nor message logging. Furthermore, actor migration is allowed while capturing global snapshots and partial snapshots can be safely used to collect garbage, therefore not requiring comprehensive cooperation among all computing nodes. These features make it unique in the area of distributed garbage collection. We formally prove the following safety and conditional liveness properties of the algorithm: 1) the garbage in a global snapshot, created by composing several local snapshots, remains the same from the beginning to the end of the global snapshot algorithm, and 2) garbage is eventually collected if the global garbage collection algorithm is periodically activated and every blocked actor is always captured before a global garbage collection phase is triggered. |
Automatic distributed garbage collection (GC) gives abstraction to grid application development, promoting code quality and improving resource management. Unreachability of active objects or actors from the root set is not a sufficient condition to collect actor garbage, making passive object GC algorithms unsafe when directly used on actor systems. In practical actor languages, all actors have references to the root set since they can interact with users, e.g., through standard input or output streams. Based on this observation, we introduce pseudo roots: a dynamic set of actors that can be viewed as the root set. Pseudo roots use protected (undeletable) references to ensure that no actors are erroneously collected even with messages in transit. Following this idea, we introduce a new direction of actor GC, and demonstrate it by developing a distributed GC framework. The framework can thus be used for automatic life time management of mobile reactive processes with unordered asynchronous communication. |
The role of distributed system in our daily life is getting more and more important. The resources of the cyber world are consumed by peers independent from their physical locations, mobile codes interact with each other and their environment on behalf of people. Grid computing provides potential benefits to applications, but as the responsibilities and the intelligence of such systems increase, security threats that they pose to the applications increases, too. In order to prevent these distributed systems from attacks, they have to be adjusted with secure components. In such systems, openness and dynamic reconfiguration is also inevitable for scalability. Furthermore, transparent information flow between the peers can help increase the efficacy for its participants and reduce user’s cognitive load. In this thesis, we describe a reputation based trust model including a language and its operational semantics, in order to reason about security problems of open and distributed systems, as well as to prove its main security properties. We furthermore, give some examples based on two common application fields of distributed computing: e-commerce and distributed file sharing. |
2005 |
Computational grids are appealing platforms for the execution of large scale applications among the scientific and engineering communities. However, designing new applications and deploying existing ones with the capability of exploiting this potential still remains a challenge. Computational grids are characterized by their dynamic, non-dedicted, and heterogeneous nature. Novel application-level and middleware-level techniques are needed to allow applications to reconfigure themselves and adapt automatically to their underlying execution environments. In this paper, we introduce a new software framework that enhances the performance of Message Passing Interface (MPI) applications through an adaptive middleware for load balancing that includes process checkpointing and migration. Fields as diverse as fluid dynamics, materials science, biomechanics, and ecology make use of parallel adaptive computation. Target architectures have tradititionally been supercomputers and tightly coupled clusters. This framework is a first step in allowing these computations to use computational grids efficiently. |
We introduce transactors, a fault-tolerant programming model for composing loosely-coupled distributed components running in an unreliable environment such as the internet into systems that reliably maintain globally consistent distributed state. The transactor model incorporates certain elements of traditional transaction processing, but allows these elements to be composed in different ways without the need for central coordination, thus facilitating the study of distributed fault-tolerance from a semantic point of view. We formalize our approach via the Ï„-calculus, an extended lambda-calculus based on the actor model, and illustrate its usage through a number of examples. The Ï„-calculus incorporates constructs which distributed processes can use to create globally-consistent checkpoints. We provide an operational semantics for the Ï„-calculus, and formalize the following safety and liveness properties: first, we show that globally-consistent checkpoints have equivalent execution traces without any node failures or application-level failures, and second, we show that it is possible to reach globally-consistent checkpoints provided that there is some bounded failure-free interval during which checkpointing can occur. |
With the proliferation of large scale dynamic execution environments such as grids, the need for providing efficient and scalable application adaptation strategies for long running parallel and distributed applications has emerged. Message passing interfaces have been initially designed with a traditional machine model in mind which assumes homogeneous and static environments. It is inevitable that long running message passing applications will require support for dynamic reconfiguration to maintain high performance under varying load conditions. In this paper we describe a framework that provides iterative MPI applications with reconfiguration capabilities. Our approach is based on integrating MPI applications with a middleware that supports process migration and large scale distributed application reconfiguration. We present our architecture for reconfiguring MPI applications, and verify our design with a heat diffusion application in a dynamic setting. |
Different coordination models exist for managing concurrency in complex distributed systems, for example, direct interaction, tuple spaces, hierarchical structures, and publish-and-subscribe mechanisms. Online auctions are complex distributed systems, where the choice of coordination model can significantly impact the performance of multiple software agents, acting on behalf of human buyers and sellers. In this paper, we evaluate analytically and experimentally different auction types and coordination models in an electronic marketplace. Three important metrics are: the number of software agents, the number of messages exchanged between these agents, and the number of migrations performed by the agents. We conclude that dynamically choosing the number of agents and the coordination model can improve quality of service in electronic commerce applications. Furthermore, in online auctions where there is a lot of interaction between software agents, migration of agents to virtual stores and local bidding is more fair and efficient than using remote interaction. This observation has the potential to significantly improve and redefine existing online auction systems such as eBay |
Data from the Sloan Digital Sky Survey has given evidence of structures within the Milky Way halo from other nearby galaxies. Both the halo and these structures are approximated by densities based on geometric objects. A model of the data is formed by a mixture of geometric densities. By using an EM-style algorithm, we optimize the parameters of our model in order to separate out these structures from the data and thus obtain an accurate dataset of the Milky Way. |
A naming service is in charge of locating a mobile agent in a distributed system given its name. Three critical characteristics of a naming service are: fault tolerance, scalability, and efficient name resolution. Most naming services provide support for efficient name resolution but do not address fault tolerance or scalability issues. Conversely, distributed hash-table approaches to naming provide fault tolerance and scalability albeit at a sacrificed name resolution performance. We introduce a Fault-Tolerant Home-Based Naming Service (FHNS) that is robust and scalable yet enabling efficient name resolution. |
Large-scale, dynamic, and heterogeneous networks of computational resources promise to provide high performance and scalability to computationally intensive applications, but these environments also introduce the need for complex resource management strategies. This paper introduces actor-based programming abstractions and a middleware framework to relieve developers from considering non-functional concerns while allowing middleware layers to optimize application performance and global resource utilization. The Internet Operating System (IOS) consists of a peer-to-peer virtual network of middleware agents that trigger application component reconfiguration based on changes in the underlying physical network and based on the application communication patterns and resource consumption. IOS middleware agents are highly customizable to account for different resource profiling, load balancing, and peer-to-peer interconnection policies. Despite the lack of global coordination and information management, IOS exhibited the ability to reconfigure distributed applications effectively improving their performance over highly dynamic networks. Diverse application communication topologies were tested on Internet-like and Grid-like environments using two middleware agent interconnection topologies: peer-to-peer (p2p) and cluster-tocluster (c2c). In most cases, p2p agent topologies outperformed c2c agent topologies for Internet-like environments; while c2c agent topologies outperformed p2p agent topologies for Grid-like environments. Our empirical results show also that using group migration of application components to perform load balancing outperforms single migration for the four studied application topologies. These empirical results suggest that adaptive middleware is needed to dynamically change the virtual network topology based on application-level communication patterns and network-level interconnectivity to improve distributed applications performance. |
Modern parallel scientific computation is being performed in a wide variety of computational environments that include clusters, large-scale supercomputers, grid environments, and metacomputing environments. This presents challenges for application and library developers, who must develop architecture-aware software if they wish to utilize several computing platforms efficiently. Architecture-aware computation can be beneficial in single-processor environments. It takes the form of something as common as an optimizing compiler, which will optimize software for a target computer. Application and library developers may adjust data structures or memory management techniques to improve cache utilization on a particular system. |
We propose a process algebraic model of a distributed system with resource access and usage control. Our system has notions of explicitly located mobile agents, migration of agents between locations, location-transparent asynchronous directed communication (local and remote), local resources within locations and local resource access using synchronous communication. Thus, agents can communicate with local and remote agents but to access a resource, they have to migrate to the appropriate location and then establish a dedicated channel to communicate with the resource. We statically guarantee bounded usage of resources. The bound on the usage of a resource may be linearly distributed amongst other agents (including newly spawned agents). Our system borrows certain concepts from the π-calculus, the actor model and mobile ambients. The type system to enforce resource access and bounded resource usage is inspired by ideas for resource access control found in the Dπ calculus and linear type systems. |
Online computations are getting more and more embedded into the daily life of people, in particular, they have practical applications in electronic commerce. In the cyber world, mobile codes interact with each other and their environment on behalf of sellers and buyers of the real world. Mobile code technologies provide potential benefits to applications, but as the responsibilities and the complexity of mobile codes increase, the variety of security threats that imposed the applications increases as well. In order to protect these inherently distributed systems from attacks, their components have to consider security aspects, without giving up flexibility. Transparent information flow between the mobile codes and their environment can help increase efficiency for its participants and reduce user’s cognitive load, yet add points of vulnerability to the system. In this thesis we describe a middleware framework for online auctions, in order to illustrate security problems in electronic commerce applications and suggest potential solutions. This middleware framework is a model of a futuristic electronic marketplace, consisting of various management components and electronic stores. In the electronic stores, the mobile codes, embedded in software agents, representing sellers and buyers, come together, in order to fulfill their duties. Furthermore, we describe a set of security requirements of the electronic marketplace. We then survey different models of mobile code in distributed computations, including actors, secure mobile actors, transactors, casts and directors, and mobile ambients. We then classify and compare these models according to their applicability to the security of the electronic marketplace scenario. We demonstrate how additional security instruments, such as proof-carrying code and cryptographic techniques, can be incorporated to these models in order to fulfill the security requirements of the marketplace electronic commerce applications. In summary, we investigated the conflicting design goals of openness and security in electronic commerce applications. This research is a first step in achieving the goal of building open yet secure electronic commerce systems. |
2004 |
Widely distributed applications using Internet resources for computing require complex resource naming, discovery, coordination, and management policies. Software complexity is dealt with by developers with middleware, software layers dealing with distribution issues, such as naming, mobility, security, load balancing, and fault-tolerance. Middleware enables application developers to concentrate on their domain of expertise, reducing code and complexity by orders of magnitude. We discuss different aspects of middleware infrastructures and we present the WorldWide Computer, our own worldwide computing infrastructure with naming, mobility, and coordination middleware layers, facilitating Internet-based distributed systems development. |
Visualizing, testing and debugging distributed systems is a challenging task that is not well ad- dressed by conventional software tools. OverView, an event-based Eclipse plug-in that provides runtime visualization of systems running on distributed Java virtual machines is presented. In the same way that the coding and debugging tools in Eclipse make writing software more accessible by visually representing both a program's static components: packages, classes, and interfaces, as well as a program's dynamic components: objects, threads, and invocation stacks; OverView in- tends to make distributed systems more accessible to programmers by creating an analogous visual workspace with appropriate abstractions for distributed component naming, state, location, remote communication, and migration. Overview is a generic visualization framework that uses an Entity Specification Language (ESL) to enable developers to map high-level concurrency and distribution abstractions into lower-level Java threads, network connections and objects. |
The Internet is constantly growing as a ubiquitous platform for high-performance distributed computing. In this paper, we propose a new software framework for distributed computing over large scale dynamic and heterogeneous systems. Our framework wraps computation into autonomous actors, self organizing computing entities, which freely roam over the network to find their optimal target execution environments. We introduce the architecture of our worldwide computing framework, which consists of an actor-oriented programming language (SALSA), a distributed run time environment (WWC), and a middleware infrastructure for autonomous reconfiguration and load balancing (IO). Load balancing is completely transparent to application programmers. The middleware triggers actor migration based on profiling resources in a completely decentralized manner. Our infrastructure also allows for the dynamic addition and removal of nodes from the computation, while continuously balancing the load given the changing resources. To balance computational load, we introduce three variations of random work stealing: load-sensitive (RS), actor topology-sensitive (ARS), and network topology-sensitive (NRS) random stealing. We evaluated RS and ARS with several actor interconnection topologies in a local area network. While RS performed worse than static round-robin (RR) actor placement, ARS outperformed both RS and RR in the sparse connectivity and hypercube connectivity tests, by a full order of magnitude |
We introduce transactors, a fault-tolerant programming model for composing loosely-coupled distributed components running in an unreliable environment such as the internet into systems that reliably maintain globally consistent distributed state. The transactor model incorporates certain elements of traditional transaction processing, but allows these elements to be composed in different ways without the need for central coordination, thus facilitating the study of distributed fault-tolerance from a semantic point of view. We formalize our approach via the Ï„-calculus, an extended lambda-calculus based on the actor model, and illustrate its usage through a number of examples. The Ï„-calculus incorporates constructs which distributed processes can use to create globally-consistent checkpoints. We provide an operational semantics for the Ï„-calculus, and formalize the following safety and liveness properties: first, we show that globally-consistent checkpoints have equivalent execution traces without any node failures or application-level failures, and second, we show that it is possible to reach globally-consistent checkpoints provided that there is some bounded failure-free interval during which checkpointing can occur. |
Computational grids are characterized by their dynamic, non-dedicated, and heterogeneous nature. Novel application-level and middleware-level techniques are needed to allow applications to reconfigure themselves and adapt automatically to their underlying execution environments to be able to benefit from computational grids’ resources. In this paper, we introduce a new software framework that enhances the Message Passing Interface (MPI) performance through process checkpointing, migration, and an adaptive middleware for load balancing. Fields as diverse as fluid dynamics, material science, biomechanics, and ecology make use of parallel adaptive computation where target architectures have traditionally been supercomputers and tightly coupled clusters. This framework is a first step in allowing these computations to use computational grids efficiently. Preliminary results demonstrate that application reconfiguration through middleware-triggered process migration achieved performance improvement in the range of 33\% to 79\%. |
With the increase in bandwidth and interconnection of computers, there has been an increasing adoption of the distributed computing paradigm to solve computationally intensive problems that would normally take a long time if solved on a single computer. However, the development of programming languages and libraries specifically developed for distributed programming tasks such as efficient work distribution, network setup, load balancing, and dynamic reconfiguration has been slow. This results in inefficient implementations and considerable time spent in developing distributed computing specific parts, rather than actual application development. The issues that were identified earlier in distributed application development span programming languages, library support, and run-time systems. This thesis addresses these issues mainly at the library level, with some programming language level and runtime level support. This thesis identifies and develops abstractions, to enable designing and developing generic components, and libraries for distributed computations. It uses a Generic Software Design Methodology to design these abstractions and to enable maximum reuse, flexibility, and efficiency. A Generic Framework for Distributed Computing - GFDC developed in this thesis can be used in a broad range of distributed applications implemented using different technologies and programming languages. Java has emerged as a platform of choice for developing distributed applications. This thesis provides an implementation of GFDC in Java, as a set of generic libraries. For a class of problems, which need reconfiguration and migration, researchers in the Worldwide Computing Laboratory at Rensselaer Polytechnic Institute are developing the SALSA programming language. The generic libraries that are developed in Java are transparently integrated with SALSA. GFDC and its implementation handle the distributed computing tasks and allow programmers to concentrate on the actual application development. This results in a flexible and efficient implementation along with reduced development time for the distributed applications. The programmers who have little or no knowledge of distributed computing issues should be able to develop applications using GFDC. At the same time, GFDC and its implementation is extensible to enable advanced programmers to customize and enhance it, to tackle specific distributed computing issues. |
2003 |
We present the preliminary design of a programming model for building reliable systems with distributed state from collections of potentially unreliable components. Our transactor model provides constructs for maintaining consistency among the states of distributed components. Our intention is that transactors should support key aspects of both traditional distributed transactions, e.g., for electronic commerce, and systems with weaker consistency requirements, e.g., peer-to-peer file- and process-sharing systems. In this paper, we motivate the need for language support for maintenance of distributed state, describe the design goals for the transactor model, provide an operational semantics for a simple transactor calculus, and provide several examples of applications of the transactor model in a higher-level language. |
Over the last two decades, efficient message passing libraries have been developed for parallel scientific computation. Concurrently, programming languages have been created supporting dynamically reconfigurable distributed systems over the heterogeneous Internet. In this paper, we introduce SALSA-MPI, an actor programming language approach to scientific computing that extends MPI with a checkpointing and migration API and a runtime system that manages both periodic checkpoints and process or application migration. The goal is to enable dynamic network reconfiguration and load balancing without sacrificing application performance or requiring extensive code modifications. As driving technology for this effort of unifying parallel and distributed computing, we plan to use adaptive solvers of partial differential equations. Fields as diverse as fluid dynamics, material science, biomechanics, and ecology make use of parallel adaptive computation, but target architectures have traditionally been supercomputers and tightly-coupled clusters. SALSA-MPI is intended to allow these computations to make efficient use of more distributed and dynamic computing resources. |
Many scientific applications require computational capabilities not easily supported by current computing environments. We propose a scalable computing environment based on autonomous actors. In this approach, a wide range of computational resources, ranging from clusters to desktops and laptops, can run an application programmed using actors as program components in an actor language: SALSA. SALSA actors have the ability to execute autonomously in dynamically reconfigurable computing environments. We develop the corresponding “Internet Operating system†(IO) to address run-time middleware issues such as permanent storage for results produced by actors, inter-actor communication and synchronization, and fault-tolerance in a manner transparent to the end-user. We are using this worldwide computing software infrastructure to solve a long outstanding problem in particle physics: the missing baryons, originally identified over thirty years ago. |
Modern distributed computing requires a secure framework capable of free code mobility. In this paper, we present a simple lambda-based actor language with extensions for mobility and security, as well as the operational semantics to reason about these topics in distributed systems. Finally, we describe our preliminary implementation results |
Online visualization enables developers to test, debug, and monitor the behavior of distributed systems, while they are running. While important in software development, online visualization of distributed systems is largely unaddressed by conventional tools. Distributed systems are often programmed using high-level abstractions that facilitate reasoning about them, e.g., actors, processes, sessions, or ambients. OverView is an entity specification language-driven Eclipse plug-in for visualization of distributed systems that preserves the high level of abstraction, and enables online visualization of critical distributed system properties such as component naming, location, remote communication, and migration. OverView’s architecture is generic in that different abstractions can reuse the visualization module requiring changes only in the entity specifications that drive the visualization process. |
As a mobile software agent migrates to various host machines on a network, collaborating agents need to be able to locate the mobile agent for communication. A naming service is in charge of locating such an agent in a distributed system given its name. Three critical characteristics of a naming service are: fault tolerance, scalability, and efficient name resolution. Most naming services provide support for efficient name resolution but do not address fault tolerance or scalability issues. Conversely, distributed hash-table approaches to naming provide fault tolerance and scalability albeit at a sacrificed name resolution performance. We introduce a Fault-Tolerant Home-Based Naming Service (FHNS) that is robust and scalable yet enabling efficient name resolution. An agent using FHNS has a unique name that encodes its home base. Home bases are connected in a peer-to-peer manner. The agent’s run-time system is responsible for keeping the agent’s home base informed of its network location. An agent obtains a reference to another agent by knowing the name of that agent and requesting any home base to resolve that agent’s location. Redundancy exists between neighboring home bases to ensure the location of any given agent can still be resolved despite an agent’s respective home base failing, thus eliminating single points of failure. In the event of a home base failure, the failover procedure occurs transparently to the agents. Furthermore, FHNS has optimal sequential fault tolerance behavior: if the home bases fail sequentially down to any single remaining home base, FHNS is still able to resolve the location of every agent in the distributed system. Resolving the location of an agent with a distributed hash-table approach, such as Chord or SPRR, requires O(log n) messages between home bases. In a worldwide computing system with potentially billions of nodes, such a lookup could still take up to 30 messages. Realizing that logarithmic lookups are not as efficient as needed for a naming service, FHNS provides name to location resolution in one round-trip request (two messages) in the general case. A logarithmic number of messages is only needed when a direct request to a home base fails. FHNS is a novel contribution being made to the field of mobile agent computing. Fault-tolerant home-based naming can be incorporated into existing mobile agent platforms to eliminate single points of failure without sacrificing efficient name resolution. |
2002 |
We present the preliminary design of a programming model for building reliable systems with distributed state from collections of potentially unreliable components. Our transactor model provides constructs for maintaining consistency among the states of distributed components. Our intention is that transactors should support key aspects of both traditional distributed transactions, e.g., for electronic commerce, and systems with weaker consistency requirements, e.g., peer-to-peer file- and process-sharing systems. In this paper, we motivate the need for language support for maintenance of distributed state, describe the design goals for the transactor model, provide an operational semantics for a simple transactor calculus, and provide several examples of applications of the transactor model in a higher-level language. |
Worldwide computing seeks to enable the seamless large- scale utilization of networked heterogeneous resources as a single virtual global computer. Web services use standard internet protocols and markup languages to support the compositional development of portable distributed systems. In this paper, we consider the future of web services as a foundation for developing worldwide computing applications. In particular, we argue that future web service technologies need to provide high-level abstractions for portable dynamic code mobility and decentralized coordination. Code mobility and service coordination are two key critical missing components to add to XML-based solutions for service definition (WSDL), data interchange (SOAP), and service registration and discovery (UDDI). |
2001 |
The enormous growth of the World-Wide Web has created the opportunity to use the combined computing and communication resources of millions of computers and devices connected to the Internet. The goal of the World-Wide Computer (WWC) project is to effectively turn the Web into a unified, dependable, distributed computing infrastructure. The WWC harnesses under-utilized computing resources by providing application programmers with the potential to globally distribute computations. Furthermore, the WWC provides mobile users and remote collaborators with a unified interface to their data and programs. Several applications in multiple domains -- as diverse as massively parallel computing, remote collaboration, coordinated computing, and Internet agents -- motivate the WWC. To realize this vision, we develop mechanisms for naming, migration, and coordination of software components and applications running on top of the Web. We represent software components as collections of actors. Actors provide autonomy, simplicity of communication and computation, and a well-developed formal semantics. Therefore, the WWC project uses the actor model of concurrent computation as a basis for studying and implementing different strategies for distributed software component naming, migration, and coordination. Universal naming is a critical aspect of accomplishing worldwide computing. The WWC''s naming strategy, based on Uniform Resource Identifier:s (URI), enables transparent migration and interconnection of actors. While a Universal Actor Name (UAN) persists over the life-time of an actor, Universal Actor Locators (UAL) change according to the actor''s current location, and prescribe a protocol for communication with the actor. Data and code migration enable scalability, more efficient network usage, improved graphical user interfaces, and mobile users and resources. Much work has been done on providing transparent access to remote objects, hiding their location information from application programmers, albeit with the unfortunate consequence of hindering efficient network programming. This thesis describes a model for both fine-grained and coarse-grained migration, which enables the development of applications with transparent actor mobility, yet allows programming control on the locality of the participating actors. Coordination of concurrent computations in the WWC is difficult. Traditional coordination mechanisms rely on a shared space, and therefore do not scale to the World-Wide Web. This thesis develops a scalable hierarchical model for coordination. Actors are grouped into casts which contain a special actor, designated as the cast director. Directors may themselves belong to other casts, creating a coordination forest. Message passing is constrained by the directors of a recipient. All directors of a target actor, up to the first common ancestor with the message sender, need to approve a message before it is delivered: directors above the first |
Applications running on the Internet, or on limited-resource devices, need to be able to adapt to changes in their execution environment at run-time. Current languages and systems fall short of enabling developers to migrate and reconfigure application sub-components at program-execution time.In this paper, we describe essential aspects of the design and implementation of SALSA, an actor-based language for mobile and Internet computing. SALSA simplifies programming dynamically reconfigurable, open applications by providing universal names, active objects, and migration. Moreover, SALSA introduces three language mechanisms to help programmers coordinate asynchronous, mobile computations: token-passing continuations, join continuations and first-class continuations.We provide some examples which illustrate how SALSA programs are not only dynamically reconfigurable and open, but also much more concise and easier to follow than comparable Java code. Furthermore, we provide empirical results which show SALSA's performance to be better than Java code using an actor library, and which illustrate the difference between local, local area, and wide area communication and migration. Finally, we discuss the implementation of our preprocessor which translates SALSA code into Java. |
Flexible and efficient naming, migration and coordination schemes are critical components of concurrent and distributed systems. This chapter describes actor naming and coordination models and infrastructures, which enable the development of mobile agent systems. A travel agent example is used to motivate the requirements and proposed solutions for naming, migration and coordination. Universal Actor Names provide location and migration transparency, while ActorSpaces enable the unanticipated connection of users, agents and services in the open, dynamic nature of today’s networks. An actor-based architecture, the World Wide Computer, is presented as a basis for implementing higher-level naming and coordination models for Internet-based agent systems. Finally, multiagent coordination is accomplished with cyborgs, an abstraction which provides a unit for group migration and resource consumption through the use of e-cash. |
1999 |
We describe a hierarchical model for coordination of concurrent activities based on grouping actors into casts and coordinating casts by actors that are designated directors. The hierarchical model provides a simple, intuitive basis for actor communication and coordination. Casts serve as abstraction units for naming, migration, synchronization and load balancing. Messengers are actors used to send messages with special behaviour across casts. Moreover, an implementation of the hierarchical model does not require a reflective run-time architecture. We present the operational semantics for our model and illustrate the model by two sample applications: an atomic multicast protocol and a messenger carrying remote exception-handling code. These applications have been implemented in Java, leveraging the existence of cross-platform, safe virtual machine implementations. |
1998 |
In this paper, we discuss some drawbacks of the Java programming language, and propose some potential improvements for concurrent object-oriented software development. In particular, we argue that Java's passive object model does not provide an effective means for building distributed applications, critical for the future of Web-based next-generation information systems. Specifically, we suggest improvements to Java's existing mechanisms for maintaining consistency across multiple threads (e.g. synchronized), sending asynchronous messages (e.g. start/run methods) and controlling resources (e.g. thread scheduling). We drive the discussion with examples and suggestions from our own work on the Actor model of computation. |
Java supports heterogeneous applications by transforming a heterogeneous network of machines into a homogeneous network of Java virtual machines. This approach abstracts over many of the complications that arise from heterogeneity, providing a uniform API to all components of an application. However, for many applications heterogeneity is an intentional feature where components and resources are co-located for optimal performance. The authors argue that Java's API does not provide an effective means for building applications in such an environment. Specifically, they suggest improvements to Java's existing mechanisms for maintaining consistency (e.g. synchronized), and controlling resources (e.g. thread scheduling). They also consider the recent addition of a CORBA API in JDK 1.2. They argue that while such an approach provides greater flexibility for heterogeneous applications, many key problems still exist from an architectural standpoint. Finally, they consider the future of Java as a foundation for component-based software in heterogeneous environments and suggest architectural abstractions which will prove key to the successful development of such systems. They drive the discussion with examples and suggestions from their work on the Actor model of computation. |
We define several actor-based abstractions (casts, directors, messengers) to effectively harness the power of the World Wide Web as a global computing infrastructure. Groups of actors, or "casts", represent an abstraction unit for naming, synchronization, migration and load balancing. Each cast contains a "director" and inter-cast communication is performed via special actors named "messengers". |
1997 |
En un sistema distribuido de área mundial, que tiene objetos que ofrecen una interfaz de servicios RPC, la técnica de cadenas SSP permite que puedan migrar a través de la red, transparentemente para sus clientes. Sin embargo esta técnica representa las referencias a objetos por medio de cadenas SSP extendidas sobre una secuencia de computadores arbitrariamente larga, y esto tiene dos inconvenientes: Primero, la disponibilidad de los objetos referenciados puede depender de que numerosos sitios no fallen. Segundo la recolección de basura distribuida es compleja. Respecto a la recuperación de fallas, la técnica de cadena SSP propone buscar exhaustivamente en el sistema, lo que puede no ser realizable, en un sistema con innumerables sitios. Lo que proponemos en este artÃculo son cambios a esta técnica para no dejar alargar las cadenas SSP y no hacer búsquedas globales. |
1996 |
1995 |
In this paper, we present some critical issues involving the browsing of object-oriented databases over the World-Wide Web, as well as performance results using our Database Browser (DB) implementation. First, given the statelessness of HTTP, we introduced a dispatcher script architecture. In this architecture, a CGI script communicates with an intermediate data server connected to the database application, which keeps the database open for faster future transactions. Second, we used ODL, a standard object definition language and we defined a simple intermediate data format for moving database objects across the network. And lastly, we defined five generic hypertext interfaces for: a database schema, a class definition, a class query form, a class extent, and a set of class instances. We implemented these ideas using ObjectStore with two database applications, one containing university registration information and another one containing electronic mailboxes. An important result was the dramatic performance improvement gained by introducing our Database Browser (DB) architecture. |
The concept of an a-shape of a finite set of points with weights in Rd is defined and illustrated. It is a polytope uniquely determined by the points, their weights, and a parameter a that controls the desired level of detail. Software that computes such shapes in dimensions 2 and 3 and is available via anonymous ftp |
The success of a distributed information system lies heavily on the simplicity for generating providing using and referring to information. The World Wide Web is composed by excellent protocols, tools and languages to perform these actions for static information which has a fixed format such as home pages. An extension to this technology called Zelig has been designed to facilitate the construction of interfaces that provide access to dynamic information which has variable formats such as the result of queries to existing databases. The goal of Zelig is to make it easy for people to create interfaces to dynamic information sources without large amounts of programming effort. Zelig generates CGI scripts which are programs that serve as mediators between Web servers and database manager systems. Its code generation is based on user interface abstractions or schemata. These schemata are written in ZHTML a high level language extension to HTML that incorporates several directives for dynamic document generation. ZHTML allows Web database designers to do very fast prototyping for database interfaces. Zelig's architecture has been implemented in a system which provides access to nearly CCSO phone nameservers around the world. This gateway currently receives around 2000 queries per day. The main contributions of this research are first a high level language for providing dynamic information in the Web. This language allows very fast prototyping by eliminating the need of extensive CGI programming. Second, the separation of database specific modules and presentation specific modules in the World Wide Web server side. This separation allows more flexible user interfaces to databases. It is hoped that in the future Zelig will enhance the richness and utility of information access available on the World Wide Web. |
1994 |
The World-Wide Web provides access to a global information universe using available technology [Berners-Lee et al. 1992]. In order to fully realize the benefits of this information system, we are developing a system, Zelig, to provide on-the-fly access to databases and dynamic information through effective user interfaces [Varela and Hayes 1994]. In this paper, we have extended Zelig to generate code for performing conversions from fixed data formats into hypertext. Consequently, information providers only need to give examples of their current database reports and the desired hypertext to be generated for those particular examples. Zelig produces the program to extract relevant data from the reports and the schemata to drive the hypertext generation process. We include as an example, an interface to ph/qi, the CCSO nameserver software providing data for academic institutions around the world. |
The World−Wide Web brings a global information universe into existence using available technology [Berners−Lee et al. 1992]. In order to fully realize the benefits of this information system, methodologies need to be developed for the creation of scripts that query existing databases and produce effective user interfaces. Present practice falls short of this goal in two areas; first, interface changes require direct modification of the scripts, and second, user interfaces are hard, in the sense that they don’t adapt to database usage. We present Zelig, a schema−based approach to HTML document generation that addresses both these problems. First, Zelig uses ZHTML schemata, which are HTML documents commented with directives for document generation. And second, Zelig contains an expert module which gives advice regarding the underlying data structures and interface design issues. This approach allows soft or evolving database applications that keep track of usage and self−adapt to increase database efficiency and to improve human−computer interaction. As an example, we have used this approach to automatically generate four different WWW interfaces to the CCSO phone nameserver database software. |
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All person copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
Les documents contenus dans ces répertoires sont rendus disponibles par les auteurs qui y ont contribué en vue d'assurer la diffusion à temps de travaux savants et techniques sur une base non-commerciale. Les droits de copie et autres droits sont gardés par les auteurs et par les détenteurs du copyright, en dépit du fait qu'ils présentent ici leurs travaux sous forme électronique. Les personnes copiant ces informations doivent adhérer aux termes et contraintes couverts par le copyright de chaque auteur. Ces travaux ne peuvent pas être rendus disponibles ailleurs sans la permission explicite du détenteur du copyright.
This document was translated from BibTEX by bibtex2html