|
Task Progress:
|
A Virtual Assistant (VA) tailored to the task at hand, called Daphne-AT (for Anomaly Treatment), has been developed. Daphne-AT has all the major capabilities required to run the experiments in the research plan (e.g., question answering, querying a database, detecting and diagnosing anomalies, suggesting actions based on the current situation). Daphne-AT follows a micro-services architecture to facilitate the addition of new skills and services to the system. It consists of a web-based front-end interface, a front-end server (the Daphne brain), which accepts user requests in natural language and directs them to the appropriate skills, a set of skills (Detection, Diagnosis, Recommendation), a set of back-end services performing both statistical reasoning and logical reasoning, and a set of data sources (real time spacecraft telemetry feed, an expert knowledge base, and a historical database). The user can select all or a subset of the anomalous parameters (called “symptoms”) and ask Daphne-AT for a diagnosis.
The VA is continuously reading sensor data from a simulated “telemetry feed” produced by the Habitat System Simulator (HSS) software developed by NASA Johnson Space Center. Daphne-AT will provide a list of possible anomalies for the symptoms selected, sorted by a qualitative likelihood score (e.g., Very Likely, Somewhat Likely). To provide these recommendations, Daphne-AT has a knowledge graph containing information about all the habitat subsystems, anomalies that can happen to them, their symptoms, and the procedures for resolving them. The user can then select one of the anomalies identified by Daphne-AT as the most likely, and Daphne-AT will suggest a procedure to resolve it. The user can follow along the procedure, checking steps as they are completed. The steps of the procedures also contain hyperlinks to figures of the relevant components or subsystems.
At any point during the task, the user can chat with the agent via a natural language interface. The baseline version of the VA used a template-based restricted domain question answering system with language models developed in-house. Using what was a state-of-the-art approach before the arrival of large language models, the question answering system classified the user question into one of N known types of questions using a small neural network. Then, the agent extracted the parameters expected for that question type from the user question and produced a query to the corresponding back-end service (e.g., a query to the knowledge graph). Finally, it inserted the answer in a predefined answer template for that question type and fed that back to the user. This approach was extremely fast and consistent, and reliable for known types of questions, but it had two major drawbacks: it struggled with parameter extraction in the presence of typos or slightly different formulations, and it could only handle the known types of questions, so it was not very scalable.
An Enhanced version of the VA was developed and used for Lab Experiment 2 and HERA Campaign 7. The main change with respect to the Baseline version is the incorporation of additional information in the diagnosis, serving as explanations of the likelihood scores. In addition to the qualitative likelihood scores shown in the Baseline version, the Enhanced version shows a confidence level in the diagnosis. In addition to the changes in the diagnosis service, a significantly improved chat agent was implemented in the Enhanced VA that leverages GPT-4 and Retrieval Augmented Generation (RAG) techniques to provide much more flexible and scalable question answering.
The specific research aims of this project were as follows:
1. Specific Aim 1 – Baseline VA: To measure the impact of the baseline VA (with question answering, but without the ability to engage in dialogues or take initiative) and its various skills (anomaly detection, diagnosis, resolution) on performance, CW, SA, and trust in a laboratory environment. 2. Specific Aim 2 – Enhanced VA: To measure the impact of advanced capabilities of the VA in terms of providing explanations on performance, CW, SA, and trust in a laboratory environment. 3. Specific Aim 3 – Analog: To deploy and validate the system in an analog environment.
Two lab experiments were conducted to support specific aims 1 and 2 described above. In Experiment 1, we studied and characterized the use of the baseline VA (without the ability to provide advanced explanations) on human performance, CW, SA, and trust. We conducted a counterbalanced, within-subjects experiment where subjects were asked to resolve multiple anomalies that are representative of a flight scenario, with and without the help of our VA. Results showed a significant positive impact of the VA on human performance, workload, and situational awareness. The results were published in a paper in the Journal of Aerospace Information Systems.
In Experiment 2, we studied the impact of giving the VA the ability to provide explanations for its actions under various levels of scenario uncertainty and VA accuracy. Subjects were divided into three groups of different VA accuracy (high, medium, low). Each subject resolved two sets of 8 anomalies, one with explanations and one without. Similarly, half of the anomalies were of high uncertainty and the other half of low uncertainty. All independent variables were randomized and counterbalanced appropriately. We measured performance (# of anomalies correctly resolved, time to diagnosis), workload (TLX, Bedford), situational awareness (SART), and trust in automation. Results showed that the explanations had a positive significant effect on performance, situational awareness, and trust without affecting workload. The results were published in two papers in the Journal of Cognitive Engineering and Decision Making, one focusing on the effect of the explanations overall, and the other focusing on uncertainty.
A third lab experiment had been proposed in the original proposal as part of Specific Aim 2, but it was decided after seeing the results of the first experiments not to pursue it. This third experiment was intended to focus on the natural language interface –specifically on giving the ability to take initiative in the dialogue. However, we observed in both lab and analog experiments that the users did not use the question answering system at all to perform the tasks. Therefore, we decided to postpone this study to the future when we understand better what kinds of anomaly scenarios prompt the user to rely more on the question answering system.
In addition to the lab experiments, two analog experiments were performed as part of HERA Campaigns 6 and 7 to achieve Specific Aim 3. In both, a similar experimental design to that of Lab Experiment 1 was used, where the crewmembers solved some anomaly scenarios with and without the VA, and we compared the outcomes in terms of performance, SA, and CW.
The baseline VA was characterized in HERA Campaign 6. Unlike in Lab Experiment 1, we found no significant effects of the VA on any metrics, except that attentional demand was significantly higher with VA than without. Crewmembers mentioned they found the anomaly scenarios easy to solve with the emergency chart (the aid provided in the no VA scenarios), so the VA was not really needed. Overall, we found lower workload, higher performance, and higher SA in C6 compared to Lab Experiment 1.
In HERA Campaign 7, the enhanced VA was used. Again, no significant effects of the VA on any metrics were found. Comparing the results with those of HERA Campaign 6, we observed Higher Understanding (a SART subscale) in C7 than in C6, but also higher attentional demand (another SART subscale) and mental demand (a TLX subscale). These results are partially consistent with those of Lab Experiment 2, since they suggest an improvement in SA due to the explanations. However, in the analog case, it comes at the cost of an increase in mental demand, which was not observed in the lab.
Overall, these four studies show potential for the use of VAs to support astronauts with anomaly resolution in LDEM but also show that the gains are strongly dependent on context, particularly for diagnosing anomalies. When it comes to diagnosis, for simpler known anomalies, a simple emergency chart like the ones used today is a sufficient aid. The VA is likely to be most useful for diagnosing more complex, unknown anomalies.
|