top of page
Building a research-grade assistant

Building Unali

Authors: Gauthier Salavert, Chiara Semenzin (PhD)

Around half of the US population is in poor health (1). For 70% of these individuals, conventional medicine is falling short due to inadequate outcomes, high costs, or inaccessibility  (2) . Consequently, this past decade has witnessed a surge in public interest in complementary types of medicine (3) . However, adoption by health providers in the form of integrative care (combining conventional with complementary medicine) remains limited. It is greatly impeded by providers’ unfamiliarity with related scientific research, potential interactions and side effects (4). Our mission is to streamline integrative medicine. We work with top academic centers and apply state-of-the-art AI tools to build the first open source dataset for complementary medicine. Our dataset is built at the intersection of scientifically validated evidence and a repository of longstanding collective wisdom. It can be accessed via Unali Health, a user-friendly web app that offers a symptom-first approach to researching various complementary treatments; or via a dedicated API, which is currently under development.


Holistic Medicine; Integrative Medicine; Complementary and Alternative Medicine; Safe AI; Large Language Models; Retrieval Augmented Generation; Chain of Thought prompting, Human-In-The-Loop.



In 2020, the number of Americans with a chronic disease had grown to 50% (5). It is an epidemic for which our society has yet to find a solution. 

In 2023, Open AI released GPT4, the latest version of its Large Language Model (LLM). In principle, LLMs hold the promise to dramatically scale good reasoning and thus assist users in discovering new solutions to old problems (6). 


We decided to put GPT4 to the test by giving it a list of 1,000 common symptoms and asking it to provide us with 5 effective natural treatments when available. Since medical decision making is probabilistic (7), process-based reasoning almost always beats outcome-based reasoning over time (8). Thus, we also prompted GPT4 to provide supporting scientific evidence for each of its answers in order to evaluate its reasoning process. We  reviewed random batches of results and found that GPT4 failed in its task 70% of the time. It is reasonable to assume that GPT4's bad thought process will yield bad outcomes over time by a similar order of magnitude. 


This shortcoming is in large part due to (i) the nature of an LLM, a text-completion engine based on massive-scale training data (9) (ii) the current paradigm in machine learning, where training data is all about scale rather than quality (10). GPT4 is an extremely powerful autocomplete system, whose reasoning is devoid of logic (at least not that was proven), trained on a dataset too vast to ensure the accuracy critical to the medical field. 


There are two solutions to address these shortcomings: fine tuning and Retrieval Augmented Generation (RAG). Fine tuning substantially improves our control over an LLM’s behavior and writing style. RAG gives us control over the dataset accessed by the LLM (11). While RAG’s anchoring of answers to specific documents represents the ideal tool for our task, it requires a highly specialized dataset of knowledge to work optimally. Existing datasets on complementary medicine are fragmented and often locked behind paywalls. They lack robust scientific knowledge resources, frequently without peer-reviewed literature sources. While some products employ Natural Language Processing (NLP) tools such as LLMs to provide scientific literature related to complementary treatments, these solutions tend to be “black-boxes”, i.e. their parameters are not easily interpretable, as they are not inherently optimized for human expert curation in a human-in-the-loop framework. Neither are they built for widespread sharing with a potentially unlimited community of users. Thus, we decided to build our own dataset. This dataset would be the first dataset dedicated to complementary medicine that is both open source and AI-ready for third parties to take advantage of it.


          1. Scientific Research

An obvious starting point to create a medical dataset is publicly available scientific research. Every day, thousands of scientific papers and preprints are published. The amount of research in the last seven years has grown by 50% (12). This however represents both an opportunity and a challenge to our endeavor: while generally, more data is seen as auspicabile in any Machine Learning task, integrating all available research in our dataset represents a computation-heavy task. At the same time, such a dataset would be tainted by the growing number of fraudulent data marrying modern research (13). To remedy this, we have to sort through what is available.

                    1.1 Hybrid search

To narrow down scientific research, we use the recommendation API built by Semantic Scholar at the Allen Institute for AI. The recommendation API relies on contextual and semantic information within scientific papers to recommend relevant and recently published papers (14). Thanks to it we were able to reduce our pool from 200 million to 100 thousand research papers relevant to complementary medicine.

The recommendation API from Semantic Scholar is a powerful tool to capture semantic relations between keywords that go beyond literal input.  However, the Transformer it relies on to generate embeddings is not optimized on a dataset of complementary medicine texts, and thus tends to underperform in this domain (15). To continue cleaning our data we made our search hybrid and added keyword filtering to the Semantic Scholar semantic approach. After this step, we narrowed down our dataset to 20000 relevant papers. 


In order to evaluate the relevance of the dataset produced by our hybrid search, we manually inspected random samples of 100 papers: the relevance had improved from 30% to 50%. 


                   1.2 Human in the loop

To further enhance the relevance of our dataset we focused our efforts on evaluating study design. We applied an automated version of standardized measures of scientific quality, specifically the Jadad scale (16)  and the Food and Drug Administration (FDA) guidelines (17), to our dataset. The FDA guidelines ensure that our investigations encompass adequate controls, sample sizes, and endpoints, as well as a thorough consideration of as many potential biases and confounding factors as possible. The Jadad Scale ensures that we can specifically assess the methodological quality of clinical trials.

After some iterations, we elected Chain of Thought Prompting as the best method (18) to implement our study design evaluation. Chain of Thought Prompting forces the LLM into a series of intermediate reasoning steps designed by us.  This is a way to inject human logic in our framework and significantly improves the LLMs reasoning ability. The results on performance were striking. We were able to take the relevance of our dataset to 90%. Through these strategic methods, we have effectively addressed some of the uncertainties surrounding LLMs. 

          2. User Generated Content

Research on complementary medicine however remains limited. A systematic review shows that this is due to (i) the paucity of funding available (19) and (ii) the difficulty to protocolize complementary medicine. While the former is being remedied by initiatives like ours, the latter is inherent and unlikely to change (despite this not necessarily a flaw) (20). Luckily, scientific research is not the only data source available. A common feature of the internet is to enable users to engage in forums or leave comments. By offering so, large platforms in health and wellness have aggregated millions of pieces of user generated content. 

                  2.1 Crowdsourcing

We intend for this user generated content to pick up where scientific research fell short by enabling the law of large numbers (21). The law of large numbers is especially suited for complex fields such as complementary medicine that cannot be solved by specifiable rules (22). By leveraging this probabilistic law and the vastness of user generated content we plan to enrich our dataset with additional facts pertaining to treatment effectiveness, side effects or interactions. We aim to also build the first knowledge graph to speed up content discovery within our dataset.


In  the first portion of the project, we will summarize user generated content using a generative feedback loop. We will then monitor the end result with our existing dataset built from scientific research as a control group. 

                  2.2 Human in the loop

To further improve results provided by the law of large numbers, we will likely have to focus our effort on evaluating incentives and diversity within our initial data sources of user generated content. We plan to apply a standardized measure of diversity, specifically developed by Scott Page (23). This measure ensures that our investigations encompass adequate cognitive differences related to perspectives, interpretations, heuristics and predictive models.


Once again we intend to use Chain of Thought Prompting as a method (18) to implement our incentives and diversity evaluation. We will monitor any improvements by using the same control group of results obtained through scientific research.


          3. Unali Health


To further support our cleaning activity, we built Unali Health. Unali Health is a search engine with a user-friendly interface destined for health and wellness professionals. The search engine monitors professionals’ behaviors and collects their feedback through a series of short forms. Unali Health is a training-wheel for our frameworks and dataset.


                   3.1 Backtracking

A recent study made by the Google AI research team shows that the best model only achieved 52.9% accuracy in identifying logical mistakes in its Chain of Thought. However when being told where the error was, the LLM was able to generate accuracy gains of up to 40% (23). In their current form, LLMs cannot find reasoning errors but they can correct them (24). More important, once corrected, mistake finding can be generalized to the whole dataset (23). Unali Health is built to leverage this opportunity by asking users to evaluate and identify mistakes across its rating systems.

                   3.2 Knowledge graph indexing

We are processing a large amount of data: 200 million research papers and 8 million pieces of user generated content. We anticipate both numbers to keep growing. Providing a satisfying search experience, one that is both accurate and fast, is an uphill battle with such a large dataset. 

Unali Health is built to monitor users’ interactions with all the conditions, symptoms and treatments displayed via its interface. This allows us to better understand relations between each entity and organize them into a knowledge graph.

Adding a knowledge graph could enable the LLMs we use with more context information. This would greatly improve speed and overall reasoning abilities through graph traversal (24).

          4. Curation


In parallel to growing and cleaning our dataset, we will curate it in ways that make it actionable for AI. This goes beyond the various cleaning activities mentioned above and includes activities such as chunking, vectorizing and indexing (25). 

This will require us to evaluate the various chunking techniques, embedding models and indexing algorithms to optimize downstream computation and output quality. We anticipate activities such as chunking and indexing to be straightforward. A task such as vectorization might require getting involved with the open source community focusing on building transformers specialized in health. 




  1. Holman, Halsted R. “The Relation of the Chronic Disease Epidemic to the Health Care Crisis.” NCBI, 19 February 2020.

  2. The Harris Poll. “The Patient Experience: Perspectives on Today's Healthcare.” AAPA, 9 November 2017.

  3. Ekor, Martins. “The growing use of herbal medicines: issues relating to adverse reactions and challenges in monitoring safety.” NCBI, 10 January 2014.

  4. Ventola, C. Lee. “Current Issues Regarding Complementary and Alternative Medicine (CAM) in the United States.” NCBI, August 2010.

  5. Holman, Halsted R. “The Relation of the Chronic Disease Epidemic to the Health Care Crisis.” NCBI, 19 February 2020.

  6. Zhaocheng Zhu. “Solving Reasoning Problems with LLMs in 2023.”
    Towards Data Science, 6 January 2024

  7. Sharma, Akhilesh Kumar, et al. “Chapter 11 - Probabilistic approaches for minimizing the healthcare diagnosis cost through data-centric operations.” Science Direct, 13 November 2023.

  8. Mauboussin, Michael. “Mauboussin Process V Outcome.”, Legg Mason, 24 May 2004.

  9. “I disagree with Geoff Hinton regarding glorified autocomplete.” Hacker News, 19 November 2023.

  10. Roon. “Text is the universal interface.” Scale AI, 8 September 2022.

  11. Hotz, Heiko. “RAG vs Finetuning — Which Is the Best Tool to Boost Your LLM Application?” Towards Data Science, 24 August 2023.

  12. Kristiansen, Nina. “The amount of research in the last seven years has grown by 50 per cent. Are important findings being overlooked?” Science Norway, 13 December 2023.

  13. “China's fake science industry: how 'paper mills' threaten progress.” Financial Times, 27 March 2023.

  14. Allen Institute for AI. “Semantic Scholar Releases New Recommendations API | AI2 Blog.” AI2, 29 June 2022.

  15. Rodney Kinney, Rodney, and +47. “The Semantic Scholar Open Data Platform.” arXiv, 24 January 2023.

  16. Evidence-based Obstetric Anesthesia. “Jadad scale for reporting randomized controlled trials.” 13 November 2023.

  17. “Evidence-Based Review for Scientific Evaluation of Health Claims.” FDA, Center for Food Safety and Applied Nutrition, 17 September 2018.

  18. Wei, Jason, et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” arXiv, 28 January 2022.

  19. Veziari, Yasamin, et al. “Barriers to the conduct and application of research in complementary and alternative medicine: a systematic review.” NCBI, 23 March 2017.

  20. Berg, M. “Problems and promises of the protocol.” PubMed, April 1997.

  21. Jing Zhang. “Knowledge Learning with Crowdsourcing: A Brief Review and Systematic Perspective.” arXiv, Cornell University, 19 june 2022.

  22. Mauboussin, Michael J. “Explaining the Wisdom of Crowds.” InnovationLabs, 20 March 2007.

  23. Loannides, Yannis M. “A Review of Scott E. Page's The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies.” ResearchGate, March 2010.

  24. Tyen, Gladys, et al. “LLMs cannot find reasoning errors, but can correct them!” arXiv, Cornell University, 14 November 2023.

  25. Google. “Can large language models identify and correct their mistakes?” Google Research, 11 January 2024.

  26. Alcaraz, Anthony. “Embeddings + Knowledge Graphs: The Ultimate Tools for RAG Systems.” Towards Data Science, 14 November 2023.

27. Monigatti, Leonie. “A Guide on 12 Tuning Strategies for Production-Ready RAG Applications.” Towards Data Science, 5 December 2023.

Create powerful new experiences
Try Unali for free.

bottom of page