Source Code

The sections below describe the core functionality of Scholaris, including helper functions for JSON schema generation, local file processing, external data retrieval from sources like NCBI, Semantic Scholar and OpenAlex, and an Assistant class for simplified chat and tool use. Follow this link to view the full source code.

CI Status PyPI Version

Note

If you are reading this as part of the documentation pages, you can view a Jupyter notebook with the ‘literate’ source code and additional tests here. Scholaris was built to run with Llama 3.1 8B using the ollama framework.

Helper functions

The Ollama framework supports tool calling (also referred to as function calling) with models such as Llama 3.1. To leverage function calling, we need to pass the JSON schema for any given function as an argument to the LLM. This is the information based on which the LLM infers the most appropriate tool to use given a prompt, and which parameters/arguments to pass to a function. To simplify the process of generating JSON schemas, use the helper and decorator functions defined below.


source

generate_json_schema

 generate_json_schema (func:Callable)

*Generate a JSON schema for the given function based on its annotations and docstring.

Args: func (Callable): The function to generate a schema for.

Returns: Dict[str, Any]: A JSON schema for the function.*


source

json_schema_decorator

 json_schema_decorator (func:~T)

*Decorator to generate and attach a JSON schema to a function.

Args: func (Callable): The function to decorate.

Returns: Callable: The decorated function with an attached JSON schema.*

Local file processing: listing, content extraction, and summarization

Note

Below are the core functions the assistant can call. With these functions, the assistant will be able to get a list of file names in a specific data directory, can extract content from these files, and summarize them.


source

get_file_names

 get_file_names (ext:str='pdf, txt')

*Retrieves a list of file names with specified extensions in a local data directory the assistant has access to on the user’s computer.

Args: ext: A comma-separated string of file extensions to filter the files by. Options are: pdf, txt, md, markdown, csv, and py. Defaults to “pdf, txt”.

Returns: str: A comma-separated string of file names with the specified extensions. If no files are found, a message is returned.

Example: >>> get_file_names(ext=“pdf, txt”)

"List of file names with the specified extensions in the local data directory: file1.pdf, file2.txt"*
Note

The next two functions are executed when calling the tool get_titles_and_first_authors, which is defined below, to extract the title and first author from PDF files in the local data directory. The first function, extract_text_from_pdf, extracts text from a PDF file using the PyPDF2 library. The second function, extract_title_and_first_author, will then use the LLM to extract the title and first author from the text extracted from the PDF file.

In addition, the extract_text_from_pdf function can also be called directly by the assistant to extract user-specified pages or sections from a PDF file, to respond to user queries or to extract specific information from a PDF file specified by the user.


source

extract_text_from_pdf

 extract_text_from_pdf (file_name:str, page_range:Optional[str]=None)

*A function that extracts text from a PDF file. Use this tool to extract specific details from a PDF document, such as abstract, authors, conclusions, or user-specified content of other sections. If the user specifies a page rage, use the optional page_range parameter to extract text from specific pages. If the user uses words such as beginning, middle, or end, to descripe the section, infer the page range based on the total number of 15 pages in a document. Do not use this tool to summarize an entire PDF document. Only use this tool for documents with extensions .pdf, or .PDF.

Args: file_name (str): The file name of the PDF document in the local data directory. page_range (Optional[str]): A string with page numbers and/or page ranges separated by commas (e.g., “1” or “1, 2”, or “5-7”). Default is None to extract all pages.

Returns: str: Extracted text from the PDF.

Example: >>> text = extract_text_from_pdf(“./test.pdf”, page_range=“1”)*


source

extract_title_and_first_author

 extract_title_and_first_author (contents:List[Dict[str,str]],
                                 model:str='llama3.1',
                                 verbose:Optional[bool]=False,
                                 show_progress:Optional[bool]=False)

*A function that extracts the titles and the first author’s names from the text of one or more research articles.

Args: contents (List[Dict[str, str]]): A list of dictionaries containing the file name and extracted text. model (str): The model to use for the extraction. Default is ‘llama3.1’. verbose (Optional[bool]): Whether to print additional information. Default is False. show_progress (Optional[bool]): Whether to show a progress bar. Default is False.

Returns: contents (List[Dict[str, str]]): The input list of dictionaries with the extracted title and first author added.

Raises: JSONDecodeError: If the JSON response is invalid.

Example: >>> contents = extract_title_and_first_author(contents) Extracting titles and first authors: 100%|██████████| 3/3 [00:22<00:00, 7.35s/it]*

Note

The next function combines the two functions extract_text_from_pdf and extract_title_and_first_author to extract the title and first author from PDF files in the local data directory. This function will be callable by the LLM.


source

get_titles_and_first_authors

 get_titles_and_first_authors ()

*A function that retrieves the titles of research articles from a directory of PDF files.

Returns: str: A JSON-formatted string containing the titles, first authors and file names of the research articles.

Raises: FileNotFoundError: If the specified directory does not exist.

Example: >>> get_titles_and_first_authors()*

Note

The functions below are executed to summarize the content of files in the local data directory.


source

summarize_local_document

 summarize_local_document (file_name:str, ext:str='pdf')

*Summarize the content of a single PDF, markdown, or text document from the local data directory.

Args: file_name (str): The file name of the local document to summarize. ext (str): The extension of the local document. Options are: pdf, txt, md, and markdown. Defaults to “pdf”.

Returns: str: The summary of the content of the local document.

Example: >>> summarize_local_document(“research_paper”, ext=“pdf”)*


source

describe_python_code

 describe_python_code (file_name:str)

*Describe the purpose of the Python code in a local Python file. This may involve summarizing the entire code, extracting key functions, or providing an overview of the code structure.

Args: file_name (str): The file name of the local Python code file document to describe.

Returns: str: A description of the purpose of the Python code in the local file.

Example: >>> describe_python_code(“main.py”, ext=“py”)*

External data retrieval from NCBI, OpenAlex and Semantic Scholar

Note

The following functions are used to convert article IDs between different formats and detect the type of an article ID based on its format.

  • convert_id: Converts article IDs between PubMed Central, PubMed, DOI, and manuscript ID formats using the NCBI ID Converter API.

  • detect_id_type: Analyzes a given string to determine if it’s a PMID, PMCID, DOI, OpenAlex ID, Semantic Scholar ID, potential article title, or an unknown format.

  • id_converter_tool: Combines the functionality of the above two functions to process a list of IDs, detecting their types and converting them using the NCBI API. This is callable by the LLM assistant.


source

id_converter_tool

 id_converter_tool (ids:List[str])

*For any article(s) in PubMed Central, find all the corresponding PubMed IDs (PMIDs), digital object identifiers (DOIs), and manuscript IDs (MIDs). Use this tool to convert a list of IDs, such as PMIDs, PMCIDs, or DOIs, and find the corresponding IDs for the same articles.

Args: ids (str): A string with a comma-separated list of IDs to convert. Must be PMIDs, PMCIDs, or DOIs. The maximum number of IDs per request is 200.

Returns: str: A JSON-formatted string containing the conversion results and the detected ID types.*


source

detect_id_type

 detect_id_type (id_string:str)

*Detect the type of the given ID or title.

Args: id_string (str): The ID or title to detect.

Returns: str: The detected type (‘pmid’, ‘pmcid’, ‘doi’, ‘openalex’, ‘semantic_scholar’, ‘potential_title’, or ‘unknown’).*


source

convert_id

 convert_id (ids:List[str])

*For any article(s) in PubMed Central, find all the corresponding PubMed IDs (PMIDs), digital object identifiers (DOIs), and manuscript IDs (MIDs).

Args: ids (List[str]): A list of IDs to convert (max 200 per request).

Returns: Str: A JSON-formatted string containing the conversion results.*

Note

The function query_openalex_api below is executed to query the OpenAlex database for additional information about a given a article, either by title, PubMed ID, PMC ID, or DOI.


source

query_openalex_api

 query_openalex_api (query_param:str)

*Retrieve metadata for a given article from OpenAlex, a comprehensive open-access catalog of global research papers. Use this tool to search the OpenAlex API by using the article title, the PubMed ID (PMID), the PubMed Central ID (PMCID) or the digital object identifier (DOI) of an article as the query parameter. This tool returns the following metadata: - the OpenAlex ID - the digital object identifier (DOI) URL - Citation count - The open access status - URL to the open-access location for the work - Publication year - A URL to a website listing works that have cite the article - The type of the article Use this tool only if an article title, PubMed ID or DOI is provided by the user or was extracted from a local PDF file and is present in the conversation history.

Args: query_param (str): The article title, the PubMed ID (PMID), the PubMed Central ID (PMCID) or the digital object identifier (DOI) of the article to retrieve metadata for. May be provided by the user or extracted from a local PDF file and present in the conversation history.

Returns: str: A JSON-formatted string including the search results from the OpenAlex database. If no results are found or the API query fails, an appropriate message is returned.*

Note

The function query_semantic_scholar_api() below is executed to query the Semantic Scholar database for additional information about a given article, either by title, PubMed ID, or DOI. To increase the rate limit, provide your own API key (see the documentation for more information).


source

query_semantic_scholar_api

 query_semantic_scholar_api (query_param:str)

*Retrieve metadata for a given article from the Semantic Scholar Academic Graph (S2AG), a large knowledge graph of scientific literature that combines data from multiple sources. Use this tool to query the Semantic Scholar Graph API by using either the article title, the PubMed ID, or the digital object identifier (DOI) to retrieve the following metadata: - the title - the publication year - the abstract - a tldr (too long, didn’t read) summary - the authors of the article - the URL to the open-access PDF version of the article, if available - the journal name - a url to the article on the Semantic Scholar website Use this tool only if an article title, PubMed ID or DOI is provided by the user or was extracted from a local PDF file and is present in the conversation history.

Args: query_param (str): The article title, the PubMed ID, or the digital object identifier of the article to retrieve metadata for. May be provided by the user or extracted from a local PDF file and present in the conversation history. Do not include the ‘https://doi.org/’ prefix for DOIs, or keys such as ‘DOI’, ‘PMCID’ or ‘PMID’. The tool will automatically detect the type of identifier provided.

Returns: str: A JSON-formatted string including the search results from the Semantic Scholar database. If no results are found or the API query fails, an appropriate message is returned.*


source

respond_to_generic_queries

 respond_to_generic_queries ()

*A function to respond to generic questions or queries from the user. Use this tool if no better tool is available.

This tool does not take any arguments.

Returns: str: A response to a generic question.*

Assistant class

Note

The Assistant class below is defined to simplify the process of chat and tool use.


source

Assistant

 Assistant (status:dict={}, sys_message:str=None,
            model:str='llama3.1:latest',
            tools:Dict[str,Any]={'get_file_names': <function
            get_file_names at 0x7f16da568af0>, 'extract_text_from_pdf':
            <function extract_text_from_pdf at 0x7f16dac09f30>,
            'get_titles_and_first_authors': <function
            get_titles_and_first_authors at 0x7f16da569f30>,
            'summarize_local_document': <function summarize_local_document
            at 0x7f16da56ac20>, 'describe_python_code': <function
            describe_python_code at 0x7f16da56add0>, 'id_converter_tool':
            <function id_converter_tool at 0x7f16da56b9a0>,
            'query_openalex_api': <function query_openalex_api at
            0x7f16da56beb0>, 'query_semantic_scholar_api': <function
            query_semantic_scholar_api at 0x7f16d9fc13f0>,
            'respond_to_generic_queries': <function
            respond_to_generic_queries at 0x7f16d9fc1750>},
            add_tools:Dict[str,Any]={},
            authentication:Optional[Dict[str,str]]=None,
            dir_path:str='../data', messages:List[Dict[str,str]]=[])

Initialize self. See help(type(self)) for accurate signature.

Type Default Details
status dict {} The status of the assistant
sys_message str None The system message for the assistant; if not provided, a default message is used
model str llama3.1:latest The model to use for the assistant
tools Dict {‘get_file_names’: <function get_file_names at 0x7f16da568af0>, ‘extract_text_from_pdf’: <function extract_text_from_pdf at 0x7f16dac09f30>, ‘get_titles_and_first_authors’: <function get_titles_and_first_authors at 0x7f16da569f30>, ‘summarize_local_document’: <function summarize_local_document at 0x7f16da56ac20>, ‘describe_python_code’: <function describe_python_code at 0x7f16da56add0>, ‘id_converter_tool’: <function id_converter_tool at 0x7f16da56b9a0>, ‘query_openalex_api’: <function query_openalex_api at 0x7f16da56beb0>, ‘query_semantic_scholar_api’: <function query_semantic_scholar_api at 0x7f16d9fc13f0>, ‘respond_to_generic_queries’: <function respond_to_generic_queries at 0x7f16d9fc1750>} The tools available to the assistant
add_tools Dict {} Optional argument to add additional tools to the assistant, when initializing
authentication Optional None Authentication credentials for API calls to external services
dir_path str ../data The directory path to which the assistant has access on the local computer
messages List [] The conversation history

source

Assistant.chat

 Assistant.chat (prompt:str, show_progress:bool=False,
                 stream_response:bool=True, redirect_output:bool=False)

*Start a conversation with the AI assistant.

Args: prompt (str): The user’s prompt or question. show_progress (bool): Whether to show the step-by-step progress of the fuction calls, including the tool calls and tool outputs. Default is False. stream_response (bool): Whether to stream the final response from the LLM. Default is True. Automatically set to True if redirect_output is True. redirect_output (bool): Whether to redirect the output to be compatible with st.write_stream. Default is False.

Returns: str: The AI assistant’s response.*


source

Assistant.show_conversation_history

 Assistant.show_conversation_history (show_function_calls:bool=False)

*Display the conversation history.

Args: show_function_calls (bool): Whether to show function calls and returns in the conversation history. Default is False.

Returns: None*


source

Assistant.clear_conversation_history

 Assistant.clear_conversation_history ()

Clear the conversation history.


source

Assistant.pprint_tools

 Assistant.pprint_tools ()

Pretty-print the descriptions of the available tools.


source

Assistant.get_status

 Assistant.get_status ()

Get the status of the assistant initialization.


source

add_to_class

 add_to_class (Class:type)

Register functions as methods in a class that has already been defined.