...
Development Roadmap
Next steps would entail:
Q1 Q2 2024
...
Going from a proof-of-concept to a production-ready platform:
Proposed way forward:
Refinement of Retrieval Mechanism: Enhance the AI's ability to accurately source and retrieve relevant data.
Re-visit Architecture
Evaluate vectorstores (Vectorstore/Knowledge Graph)
Set up Semantic Retrieval
Enrichment of Sources: Expand and diversify the data sources to improve the tool's comprehensiveness.
Fine-Tuning for Quality: Optimize the AI algorithms to ensure high-quality, accurate compliance term generation.
...
Set up evaluation RAG Triad with set of evaluation questions
Iterate with:
Re-rank,
Sentence-window and
Auto-merging Retrieval
Self-evaluation by LLM
Q1 2024 Internal Beta Roll-out:
POC to Production in AWS: Transition the Proof of Concept (POC) into a full-scale production environment on AWS.
Integration with Mapper Team: Provide the tool to the mapper team for human validation, ensuring accuracy and reliability.
Q2/Q3 2024 External Beta Roll-out:
Following successful internal use and quality benchmarks, release the tool externally as a feature of the Dictionary solution.
...
Financials
The AI-based Term Creation tool is projected to be cost-neutral, delivering significant internal cost savings and serving as a valuable solution for customers.
Cost Savings
...
(validated)
UCF Mapping Team Efficiency: The team has created 1,351 new terms year-to-date, spending 20-30 minutes per term, totaling approximately 563 hours.
Hourly Cost Savings: With an internal cost of $60/hour, the tool offers an annual cost saving of $33,775.
Revenue Projections
...
Pricing Strategy: The tool will be priced at $5/user/month, with additional charges for token credits in case of over-use.
Market Penetration Assumptions:
Existing Customer Base Penetration: Estimated at 50%, with an average of 2 users per account.
Annual Recurring Revenue (ARR): Potential ARR is projected to be $240,000
...
The POC could digest as source information for its compliance knowledge both PDF documents, and websites.
Detail on Development Roadmap
Refinement of Retrieval Mechanism
The main evaluation parameters for RAG:
...
Precision
Performance of RAG hinges very much on its capability to make sure the context is precisely related to the Query.
Hallucinations
RAG is an effective way to combat ‘hallucinations’ of an LLM. However, even with RAG, problems can occur.
When the LLM does not find relevant information in its added knowledge contexts, it tries to produce an answer, and goes to its pre-existing knowledge from its pre-training phase.
So when context relevance is low, LLMs tend to ‘fill in’ gaps with their ‘general’ knowledge from the training phase. This answer will have low groundnedness even though the answer might seem like a good answer and be contextually relevant.
Precision: Some underlying problems:
Chunk size
...
Fragmentation
After vector retrieval, we are feeding a bunch of fragmented chunks of information into the LLM context window, and the fragmentation is worse the smaller our chunk size.
...
Evaluation
Additional evaluation parameters:
...
Solution approaches
Precision
Reranking
After first semantic search obtained with Dense Retrieval, we add an additional sorting using traditional semantic search for the first set of results.
Sentence-window retrieval
Retrieve not only the sentence found in embedding lookup but also the sentence before and after.
...
LLM’s work better with larger chunks but vector-based retrieval delivers smaller chunks.
...
After embedding and before sending the chunks to the LLM, we re-rank the chunks.
Auto-merging retrieval
We create a hierarchy of larger parent nodes with smaller children nodes.
For results in the embedding lookup, the child node will be merged into the parent node if a threshold is exceeded.
...
“Self-evaluation” by LLM
...
Examples of LLM evaluations
...
Result exploration
Evaluation results can be visualized, below is an example of how this can be done with e.g. Trulens.
...
Evaluation allows to drill down into individual results
...
And show the feedback from the LLM on the results. This allows to get insights into how changing parameters influence results.
...
Potential metrics
Accuracy Rate:
Definition Correctness: Percentage of terms where the generated definition accurately reflects the intended meaning.
Reference Relevance: Proportion of contextually relevant references to the generated terms.
Error Rate:
Misinterpretation Frequency: Track the frequency of incorrect interpretations or irrelevant definitions generated.
Inconsistency Detection: Measure instances where the tool provides varying quality across similar requests.
Response Time:
Generation Speed: Monitor the average time to generate a term and its definition, ensuring it meets efficiency standards.
Usage Metrics:
Adoption Rate: Track the number of active users and frequency of use, indicating the tool's perceived value.
Repeat Usage: Measure how often users return to the tool, indicating reliance and satisfaction.
Benchmarking:
Comparison with Manual Processes: Compare the quality of terms generated by the tool against those created manually.
Competitor Comparison: Regularly compare the tool's output quality against similar offerings in the market.