Source Code
If you are reading this as part of the documentation pages, you can view a Jupyter notebook with the ‘literate’ source code and additional tests here. Scholaris was built to run with Llama 3.1 8B using the ollama framework.
Helper functions
The Ollama framework supports tool calling (also referred to as function calling) with models such as Llama 3.1. To leverage function calling, we need to pass the JSON schema for any given function as an argument to the LLM. This is the information based on which the LLM infers the most appropriate tool to use given a prompt, and which parameters/arguments to pass to a function. To simplify the process of generating JSON schemas, use the helper and decorator functions defined below.
generate_json_schema
generate_json_schema (func:Callable)
*Generate a JSON schema for the given function based on its annotations and docstring.
Args: func (Callable): The function to generate a schema for.
Returns: Dict[str, Any]: A JSON schema for the function.*
json_schema_decorator
json_schema_decorator (func:~T)
*Decorator to generate and attach a JSON schema to a function.
Args: func (Callable): The function to decorate.
Returns: Callable: The decorated function with an attached JSON schema.*
Local file processing: listing, content extraction, and summarization
Below are the core functions the assistant can call. With these functions, the assistant will be able to get a list of file names in a specific data directory, can extract content from these files, and summarize them.
get_file_names
get_file_names (ext:str='pdf, txt')
*Retrieves a list of file names with specified extensions in a local data directory the assistant has access to on the user’s computer.
Args: ext: A comma-separated string of file extensions to filter the files by. Options are: pdf, txt, md, markdown, csv, and py. Defaults to “pdf, txt”.
Returns: str: A comma-separated string of file names with the specified extensions. If no files are found, a message is returned.
Example: >>> get_file_names(ext=“pdf, txt”)
"List of file names with the specified extensions in the local data directory: file1.pdf, file2.txt"*
The next two functions are executed when calling the tool get_titles_and_first_authors
, which is defined below, to extract the title and first author from PDF files in the local data directory. The first function, extract_text_from_pdf
, extracts text from a PDF file using the PyPDF2 library. The second function, extract_title_and_first_author
, will then use the LLM to extract the title and first author from the text extracted from the PDF file.
In addition, the extract_text_from_pdf
function can also be called directly by the assistant to extract user-specified pages or sections from a PDF file, to respond to user queries or to extract specific information from a PDF file specified by the user.
extract_text_from_pdf
extract_text_from_pdf (file_name:str, page_range:Optional[str]=None)
*A function that extracts text from a PDF file. Use this tool to extract specific details from a PDF document, such as abstract, authors, conclusions, or user-specified content of other sections. If the user specifies a page rage, use the optional page_range parameter to extract text from specific pages. If the user uses words such as beginning, middle, or end, to descripe the section, infer the page range based on the total number of 15 pages in a document. Do not use this tool to summarize an entire PDF document. Only use this tool for documents with extensions .pdf, or .PDF.
Args: file_name (str): The file name of the PDF document in the local data directory. page_range (Optional[str]): A string with page numbers and/or page ranges separated by commas (e.g., “1” or “1, 2”, or “5-7”). Default is None to extract all pages.
Returns: str: Extracted text from the PDF.
Example: >>> text = extract_text_from_pdf(“./test.pdf”, page_range=“1”)*
summarize_local_document
summarize_local_document (file_name:str, ext:str='pdf')
*Summarize the content of a single PDF, markdown, or text document from the local data directory.
Args: file_name (str): The file name of the local document to summarize. ext (str): The extension of the local document. Options are: pdf, txt, md, and markdown. Defaults to “pdf”.
Returns: str: The summary of the content of the local document.
Example: >>> summarize_local_document(“research_paper”, ext=“pdf”)*
describe_python_code
describe_python_code (file_name:str)
*Describe the purpose of the Python code in a local Python file. This may involve summarizing the entire code, extracting key functions, or providing an overview of the code structure.
Args: file_name (str): The file name of the local Python code file document to describe.
Returns: str: A description of the purpose of the Python code in the local file.
Example: >>> describe_python_code(“main.py”, ext=“py”)*
External data retrieval from NCBI, OpenAlex and Semantic Scholar
The following functions are used to convert article IDs between different formats and detect the type of an article ID based on its format.
convert_id
: Converts article IDs between PubMed Central, PubMed, DOI, and manuscript ID formats using the NCBI ID Converter API.detect_id_type
: Analyzes a given string to determine if it’s a PMID, PMCID, DOI, OpenAlex ID, Semantic Scholar ID, potential article title, or an unknown format.id_converter_tool
: Combines the functionality of the above two functions to process a list of IDs, detecting their types and converting them using the NCBI API. This is callable by the LLM assistant.
id_converter_tool
id_converter_tool (ids:List[str])
*For any article(s) in PubMed Central, find all the corresponding PubMed IDs (PMIDs), digital object identifiers (DOIs), and manuscript IDs (MIDs). Use this tool to convert a list of IDs, such as PMIDs, PMCIDs, or DOIs, and find the corresponding IDs for the same articles.
Args: ids (str): A string with a comma-separated list of IDs to convert. Must be PMIDs, PMCIDs, or DOIs. The maximum number of IDs per request is 200.
Returns: str: A JSON-formatted string containing the conversion results and the detected ID types.*
detect_id_type
detect_id_type (id_string:str)
*Detect the type of the given ID or title.
Args: id_string (str): The ID or title to detect.
Returns: str: The detected type (‘pmid’, ‘pmcid’, ‘doi’, ‘openalex’, ‘semantic_scholar’, ‘potential_title’, or ‘unknown’).*
convert_id
convert_id (ids:List[str])
*For any article(s) in PubMed Central, find all the corresponding PubMed IDs (PMIDs), digital object identifiers (DOIs), and manuscript IDs (MIDs).
Args: ids (List[str]): A list of IDs to convert (max 200 per request).
Returns: Str: A JSON-formatted string containing the conversion results.*
The function query_openalex_api
below is executed to query the OpenAlex database for additional information about a given a article, either by title, PubMed ID, PMC ID, or DOI.
query_openalex_api
query_openalex_api (query_param:str)
*Retrieve metadata for a given article from OpenAlex, a comprehensive open-access catalog of global research papers. Use this tool to search the OpenAlex API by using the article title, the PubMed ID (PMID), the PubMed Central ID (PMCID) or the digital object identifier (DOI) of an article as the query parameter. This tool returns the following metadata: - the OpenAlex ID - the digital object identifier (DOI) URL - Citation count - The open access status - URL to the open-access location for the work - Publication year - A URL to a website listing works that have cite the article - The type of the article Use this tool only if an article title, PubMed ID or DOI is provided by the user or was extracted from a local PDF file and is present in the conversation history.
Args: query_param (str): The article title, the PubMed ID (PMID), the PubMed Central ID (PMCID) or the digital object identifier (DOI) of the article to retrieve metadata for. May be provided by the user or extracted from a local PDF file and present in the conversation history.
Returns: str: A JSON-formatted string including the search results from the OpenAlex database. If no results are found or the API query fails, an appropriate message is returned.*
The function query_semantic_scholar_api()
below is executed to query the Semantic Scholar database for additional information about a given article, either by title, PubMed ID, or DOI. To increase the rate limit, provide your own API key (see the documentation for more information).
query_semantic_scholar_api
query_semantic_scholar_api (query_param:str)
*Retrieve metadata for a given article from the Semantic Scholar Academic Graph (S2AG), a large knowledge graph of scientific literature that combines data from multiple sources. Use this tool to query the Semantic Scholar Graph API by using either the article title, the PubMed ID, or the digital object identifier (DOI) to retrieve the following metadata: - the title - the publication year - the abstract - a tldr (too long, didn’t read) summary - the authors of the article - the URL to the open-access PDF version of the article, if available - the journal name - a url to the article on the Semantic Scholar website Use this tool only if an article title, PubMed ID or DOI is provided by the user or was extracted from a local PDF file and is present in the conversation history.
Args: query_param (str): The article title, the PubMed ID, or the digital object identifier of the article to retrieve metadata for. May be provided by the user or extracted from a local PDF file and present in the conversation history. Do not include the ‘https://doi.org/’ prefix for DOIs, or keys such as ‘DOI’, ‘PMCID’ or ‘PMID’. The tool will automatically detect the type of identifier provided.
Returns: str: A JSON-formatted string including the search results from the Semantic Scholar database. If no results are found or the API query fails, an appropriate message is returned.*
respond_to_generic_queries
respond_to_generic_queries ()
*A function to respond to generic questions or queries from the user. Use this tool if no better tool is available.
This tool does not take any arguments.
Returns: str: A response to a generic question.*
Assistant class
The Assistant class below is defined to simplify the process of chat and tool use.
Assistant
Assistant (status:dict={}, sys_message:str=None, model:str='llama3.1:latest', tools:Dict[str,Any]={'get_file_names': <function get_file_names at 0x7f16da568af0>, 'extract_text_from_pdf': <function extract_text_from_pdf at 0x7f16dac09f30>, 'get_titles_and_first_authors': <function get_titles_and_first_authors at 0x7f16da569f30>, 'summarize_local_document': <function summarize_local_document at 0x7f16da56ac20>, 'describe_python_code': <function describe_python_code at 0x7f16da56add0>, 'id_converter_tool': <function id_converter_tool at 0x7f16da56b9a0>, 'query_openalex_api': <function query_openalex_api at 0x7f16da56beb0>, 'query_semantic_scholar_api': <function query_semantic_scholar_api at 0x7f16d9fc13f0>, 'respond_to_generic_queries': <function respond_to_generic_queries at 0x7f16d9fc1750>}, add_tools:Dict[str,Any]={}, authentication:Optional[Dict[str,str]]=None, dir_path:str='../data', messages:List[Dict[str,str]]=[])
Initialize self. See help(type(self)) for accurate signature.
Type | Default | Details | |
---|---|---|---|
status | dict | {} | The status of the assistant |
sys_message | str | None | The system message for the assistant; if not provided, a default message is used |
model | str | llama3.1:latest | The model to use for the assistant |
tools | Dict | {‘get_file_names’: <function get_file_names at 0x7f16da568af0>, ‘extract_text_from_pdf’: <function extract_text_from_pdf at 0x7f16dac09f30>, ‘get_titles_and_first_authors’: <function get_titles_and_first_authors at 0x7f16da569f30>, ‘summarize_local_document’: <function summarize_local_document at 0x7f16da56ac20>, ‘describe_python_code’: <function describe_python_code at 0x7f16da56add0>, ‘id_converter_tool’: <function id_converter_tool at 0x7f16da56b9a0>, ‘query_openalex_api’: <function query_openalex_api at 0x7f16da56beb0>, ‘query_semantic_scholar_api’: <function query_semantic_scholar_api at 0x7f16d9fc13f0>, ‘respond_to_generic_queries’: <function respond_to_generic_queries at 0x7f16d9fc1750>} | The tools available to the assistant |
add_tools | Dict | {} | Optional argument to add additional tools to the assistant, when initializing |
authentication | Optional | None | Authentication credentials for API calls to external services |
dir_path | str | ../data | The directory path to which the assistant has access on the local computer |
messages | List | [] | The conversation history |
Assistant.chat
Assistant.chat (prompt:str, show_progress:bool=False, stream_response:bool=True, redirect_output:bool=False)
*Start a conversation with the AI assistant.
Args: prompt (str): The user’s prompt or question. show_progress (bool): Whether to show the step-by-step progress of the fuction calls, including the tool calls and tool outputs. Default is False. stream_response (bool): Whether to stream the final response from the LLM. Default is True. Automatically set to True if redirect_output is True. redirect_output (bool): Whether to redirect the output to be compatible with st.write_stream. Default is False.
Returns: str: The AI assistant’s response.*
Assistant.show_conversation_history
Assistant.show_conversation_history (show_function_calls:bool=False)
*Display the conversation history.
Args: show_function_calls (bool): Whether to show function calls and returns in the conversation history. Default is False.
Returns: None*
Assistant.clear_conversation_history
Assistant.clear_conversation_history ()
Clear the conversation history.
Assistant.pprint_tools
Assistant.pprint_tools ()
Pretty-print the descriptions of the available tools.
Assistant.get_status
Assistant.get_status ()
Get the status of the assistant initialization.
add_to_class
add_to_class (Class:type)
Register functions as methods in a class that has already been defined.