PRD - Federated Dictionary
Overview
Background
Online dictionaries such as Merriam-Webster, Oxford English Dictionary, and Cambridge Dictionary have become commonplace for writers and readers who rely on dictionaries to get a clear meaning of words they come across or intend to use. Specialists such as lexicographers and glossarists rely on these dictionaries to research existing terminological entries and may utilize many different online dictionaries while conducting their research. Many of these dictionaries also have API capabilities with disparate API definitions with no central hub connecting all dictionaries into a federated system.
Unified Compliance has its own Compliance Dictionary which has been built up over years primarily focused on terms (mostly nouns and verbs) found in authority documents. These terms are used during the tagging process (automated with NLP and manual) to help break down authority documents into citations then into actionable mandates which can then be mapped to common controls.
Objective
To help fulfill Unified Compliance’s mission statement to be THE authoritative platform for regulatory content, control frameworks, and policy guidance, one of the initiatives is to build products and functionality that extend the usage beyond professional lexicographers and glossarists to include many additional users to allow them to contribute to UC’s content repository.
To encourage and facilitate a wider set of users utilizing UC functionality, the objective of the Federated Dictionary is to create a single-point federated research dictionary that leverages many dictionary APIs together allowing researchers (professional and non-professional alike) to more quickly and easily research and contribute terms a common dictionary. By increasing the content in the Federated Dictionary, we will make it substantially easier for other documents to be ingested (programmatically or manually) into the UCF.
The Federated Dictionary and related APIs will need to support a workflow including expert validation tasks to ensure the dictionary entries are appropriately defined. However, the workflows must be made simple and efficient to expedite content contribution.
Personas
Note: personas are in process of being built. Once in place, the links will be provided to those personas.
UC Mapper (linguist, English major, attention to detail …)
UC Mapper Approver (linguist, subject matter experts, …)
Partner Contributor (expert in their software and processes, not a linguist, busy with other tasks …)
Customer Contributor (institutional organization knowledge, domain expert, not a linguist, likely minimal compliance knowledge …)
Success Metrics
List project goals and the metrics we’ll use to judge success
Goal | Metric |
---|---|
Expand the automatic and manual tagging process to use dictionaries beyond UC’s Compliance Dictionary | Any document, well beyond those produced by standards bodies like PCI-DSS and NIST, can easily be tagged (both automated and manual) utilizing terms found in third-party dictionaries |
Expand dictionary usage and term population beyond the UC Mapper team | Substantially increase the number of users outside of Unified Compliance contributing terms to the Federated Dictionary |
Current Challenges / Limitations
Time Consuming: The UC Compliance Dictionary includes terms (nouns and verbs) including multiword expression terms that have been built up over time. However, even with this expansive dictionary, on average, 24 new terms are required to be researched and defined each time a new Authority Document is added to the UC’s content repository. Researching, defining, and adding new terms to the Compliance Dictionary takes time and effort, lengthening the tagging and mapping process.
An example of how a researcher would performs their research is as follows:
The researcher loads a new tab for the dictionary of choice and performs a search. If there is a response that the researcher is satisfied with, and the researcher doesn’t wish to search further, the researcher then can cite the selected response including the term-definition pair, citation, and attribution. However, if the researcher isn’t satisfied with the response (or there is no response at all), the researcher will then navigate to another dictionary site and perform the same search again. If the researcher receives multiple responses from different sites, the researcher will switch tabs or open them side-by-side in order to compare results.
Limited in Scope: Since the Compliance Dictionary was created with terms primarily related to Authority Documents, it is limited in scope such that documents like an organization’s policies and procedures, or VM security setup guidelines will have many more terms not defined in the Compliance Dictionary making it even more time consuming for these documents to be tagged and mapped.
Lack of Consolidated View: With the proliferation of the online dictionaries including Merriam-Webster, Oxford English Dictionary, Cambridge Dictionary and others and their respective APIs, there is no consolidated / federated interface allowing researchers search across all dictionaries and compare results when conducting term, definition, and thesaurus research making it time consuming and tedious to open multiple browser tabs or windows and perform identical searches on multiple sites.
Benefits
Allowing content contributors (customers, partners, or UC) to have easy access to third party dictionaries will make it substantially faster for contributors to research terms, add terms to the Federated Dictionary, and tag documents.
As the Federated Dictionary grows in content, the automatic tagging process will increase the hit rate of matched terms, reducing the overall time for research including the dictionary update process.
As the Federated Dictionary grows in content, many other documents outside of UC’s domain expertise and organization-specific content (e.g., policies and procedures) can quickly be tagged and mapped into the UCF.
As the UC Federated Knowledge Graph grows in content and users (persons, organizations, and automated systems), the potential network economics and value increase if monetization and market tipping, critical mass occurs while maintaining security and trust. "Network Effect." Wikipedia, Wikimedia Foundation, 19 Nov. 2022, Network effect. (2022, November 19). In Wikipedia. Accessed 23 Dec. 2022.
Subscription / Pricing / Billing Impacts
Accessing the third-party dictionaries will include a nominal cost. This incremental cost must be passed along to customers.
There is a potential for licensing the Federated Dictionary as a standalone offering if it satisfies an underserved need for researchers or other users by reducing research time, includes additional content such as multi-part words, better representation of relationships, and more.
Beta and Early Access
Alpha and Beta access must be made available to selected members of the UC mapping team as they are the expert end-users of the Authority Document research, tagging, and mapping process.
To prove or disprove the hypothesis that the Federated Dictionary has value outside of Authority Document tagging and research, early access should be given to a wider set of target users and use cases.
Risks and Assumptions
There must be an approval / validation process to ensure the term definition is done properly. However, there must be options to make the workflow easy and fast. Note that Wikipedia has opined on different options to include an approval mechanism as called out in the references below. As they mention the process must not be too rigorous where experts take too long and slow the process down which is what led to Nupedia's failure
There could potentially be multiple contributors wanting to contribute overlapping and contradictory term definitions. We need to put in place mechanisms to allow for multiple definitions (as is done within online dictionaries) and potentially include a ranking or rating mechanism.
One of the downsides of a federated approach is that historical data is only stored in the source systems. Not locally in the federated data store since the federated data store will typically store metadata with reference to the source data and not the data itself. If any change is made in the third-party dictionary, the source dictionary has the responsibility of storing the history. However, if we only store metadata references, the Federated Dictionary will be unaware of the history and unaware of any changes. If we store more than just metadata including the current definition in the dictionary, we will need to put in place an update/sync mechanism and determine whether to keep history or not.
As is today with the Compliance Dictionary, it is assumed that term-definition pairs are required to be saved/persisted into a common/global dictionary (whether this is the compliance dictionary or a new federated dictionary) with reference to sources which will be used for NLP and tagging.
Milestones and Phases
List the project milestones along with how that milestone can be successfully measured.
Number | Description | Success Measurement |
---|---|---|
| Common schema defined and mapped to the five (5) identified dictionary partners of: |
|
| … |
|
Product Requirements
Use Cases
Happy Day Scenarios
As a researcher, I can easily search for terms and visually see results from multiple dictionaries, I can optionally pick the source definition that most accurately relates to my research and quickly and easily save the reference in the Federated Dictionary for follow-on work.
As a content contributor during the mapping process, I can easily identify a term that was not stored in the Federated Dictionary. I am presented with options from multiple third-party dictionaries, where I pick the best one that relates to the document I am mapping. That term is automatically added to the Federated Dictionary with the term-definition pair, reference to source, synonyms and all other pertinent information.
Rainy Day Scenarios
As a researcher, I see there is an existing term definition in the Federated Dictionary, but that reference and definition does not accurately match the definition for my research. I need to use a different source and want that source referenced in the Federated Dictionary.
As a researcher, I tagged a document a while back with reference to a term-definition pair where that definition came from one of the third-party dictionaries, but now that third party dictionary has updated the definition which no longer fits the cited source document.
Requirements
Requirement | User Story | Importance | Jira Issue | Comments |
---|---|---|---|---|
Dictionary Search |
|
|
|
|
Ability for researchers to search for terms using a word, phrase, ID, acronym, or definition | As a researcher I want to search for existing Dictionary Terms so that I can help understand the documents I am reviewing | p1 |
|
|
Ability to perform an advanced search |
| p2 |
|
|
Ability for the dictionary to suggest terms with autocomplete as the researcher enters a term | As a researcher I want the dictionary search to propose dictionary terms as I am typing to expedite the search process | p1 |
|
|
Ability for researchers to view details of an existing term’s dictionary-related details | As a researcher I want to view the details of an existing dictionary term found in the common dictionary so that I can further my research | p1 |
| Must include details like one sees today in the Compliance Dictionary including acronyms, preferred term, non-standard terms, definitions (type, definition and source citation), other forms (plural, plural possessive, possessive), and relationships |
Ability for researchers to view details of third-party dictionary results in addition to existing dictionary related results | As a researcher I want to view the details of third-party dictionary terms so that I can quickly perform my research without having to jump from dictionary to dictionary | p1 |
| There should be an option for the researcher to display third-party results in addition to the common dictionary or display those results when no common dictionary term is found |
Ability for researchers to view details of an existing term’s mapping-related details | As a researcher I want to view the details of an existing dictionary term found in the common dictionary and how it relates to existing documents so that I can further my research | p1 |
| Must include details like one sees today in the Compliance Dictionary including related common controls (tagged and not-tagged terms), related citations (tagged and not-tagged terms) |
Dictionary Term Request |
|
|
|
|
Ability for researchers to submit a Term Request for a new term based on an integrated cited third-party dictionary source | As a content contributor I want to submit a request for a new Term with reference to a third-party dictionary that I want added to the dictionary so that my team can reference that term | p1 |
| Who should be able to add terms from other dictionaries? Will need some governance |
Ability for researchers to submit a Term Request for a new term without citing an integrated dictionary source | As a content contributor I want to submit a request for a new Term without reference to a third-party dictionary that I want added to the dictionary so that my team can reference that term | p1 |
| These could be multi-part words not defined in any of the dictionaries or other newer terms not yet defined Will need some governance |
Ability for researchers to search for existing Term Requests | As a content contributor I want to search for existing Term Requests so that I am aware of its status and can take further actions | p1 |
|
|
Ability for researchers to view, edit, delete, and withdraw existing Term Requests | As a content contributor I want to view, edit, delete, and withdraw existing requests so that I can manage the queue of work I want done | p1 |
|
|
Ability for reviewer(s) to validate and approve new terms, updates to existing term, and newly cited terms | As a reviewer I want to review and approve the Term Request details to ensure it has all the pertinent information | p1 |
| Similar to what is defined in the Term Request process, need ability to set 0 to N number of reviewers in the business rules
|
Ability to check for duplicates | As a content contributor I want the system to automatically check for duplicates to ensure the Dictionary isn’t populated with overlapping redundant data | p1 |
|
|
Ability to submit a Term Request for a new term, update existing, or newly cited term as private | As a content contributor I want the Term to be visible only to users in my organization so that no other organizations are able to access our private and/or proprietary content | p2 |
|
|
Ability to submit an updated term definition and keep the version and history | As a content contributor I want to submit a subsequent version of the Term to go through the Term Request process so that any changes can be tracked | p2 |
| Need to discuss this one since if terms have relationships with Citations or Common Controls, changing the definition could weaken the reference |
Review Workflow |
|
|
|
|
Ability for account administrators to setup business rules for validation of a term | As an account administrator I want to manage the submission and review workflow to best match the structure of my team so that we can submit terms as efficiently as possible | p1 |
| For some contributors, they might not want any review since the person submitting the term is the domain expert. Other accounts may want one level of approval |
Ability to setup task owners as a person, role or group | As an account administrator I want to flexibly manage the assignment of tasks in the term definition workflow process to control the queue of work so that I can best manage the workload of my team | p1 |
| Above requirement was for the number of steps. This one is for who is assigned to those steps. Indirect assignments like roles and groups makes it much easier to maintain and quickens onboarding / offboarding efforts. Large clients need to have many people |
Ability to submit unvalidated Term Requests | As a content contributor I want to contribute terms that are automatically approved with no need for validation so that my terms can be quickly used for tagging | p1 |
| These terms must be marked as “unvalidated” to help end-users know that those terms should be used with caution |
Contribution Managment and API Protection |
|
|
| Most/all of the Contribution Management requirements are intended to be used as a guideline for all other objects that we open up to contributors including authority documents and dictionary terms. These rules must be enforced within APIs to protect the content quality |
Ability for UC to revoke or allow organizations to contribute dictionary terms | As a UC Administrator I want to allow or revoke the ability of an organization to contribute so that UC can control who can contribute The API rules engine must check this setting to allow contribution |
|
| The initial gating factor for contribution. |
Ability for UC to revoke or allow contributors to validate their own dictionary terms | As a UC Administrator I want to allow or revoke the ability of a contributor to verify their own Terms without UC intervention so that UC can protect the UCF content quality while allowing others to contribute
The API rules engine must check this setting to allow validation of own content |
|
| Validation permissions can be set at an organization, group, role or person level. Initially the organization level will suffice. If not allowed, other designated Organizations (starting with UC) must perform the validation. Later releases can then allow other Organizations to be designated as validators by UC. Changes must go into effect immediately and retroactively to ensure that any content that is currently in process will follow the updated rule. How can we detect collusion? |
Ability for UC to designate an organization as a certified validator | As a UC Administrator I want to designate whether an organization is a certified validating entity so that we can control validation and inform users of the validation certification
The API rules engine must check this setting to determine whether organization is certified or not |
|
| This will help to curtail collusion by only allowing certified organizations to validate content created by others. As a subsequent step we could limit the areas of validation to geography, subject matter, and/or industry. |
Ability for UC to set validation indicator for contributors who are allowed to validate their own Terms | As a UC Administrator I want to inform users that content was contributed without formal validation by a certified validator so that users can make an informed choice to use unvetted content or not
The API rules engine must set the certification indicator (or not) based upon the organization’s certification status |
|
| If we allow contributors to validate their own content, UC can decide whether or not to designate the contributed content has been validated by a certified validator. This will allow contributors to validate their own content and allow UC one additional instrument to inform others. UC can require a certification process which must be periodically completed to allow for certification. |
Ability for UC to set and enforce minimal number of validation workflow steps | As a UC Administrator I want to determine the minimal number of validation steps so that UC can increase the likelihood of quality content The API rules engine must check this setting to determine and enforce number of validation steps |
|
| When set to 0, an API call must still be made to designate the Term has been validated and can be made using the same person / auth key. Otherwise, the Term Request stays in and unvalidated state. When set to 1 or more, an approval workflow must be in place otherwise content cannot be added. The approval workflow will designate the group, role, or person. Each person / auth key must be different for each step in the approval flow else the validation cannot be performed. If the organization is not allowed to perform their own validation, a final validation step will be required as mentioned in the prior requirement |
Term Requests can only be updated or deleted by the contributor owner (person or organization) or UC | The API Rules Engine must ensure contributors only update their own content so that content updates can be controlled |
|
| Ensures contributors only update their own content |
Term Request owners can be changed by UC | As a UC Administrator I want to change the owner of a Term Request so that it can be further processed |
|
| If a person or organization no longer exists or is determined to be a “bad actor”, there needs to be a way to change the owner to someone else so that subsequent changes can be made |
Term Requests cannot be deleted once in a validated state | The API Rules Engine must ensure contributors do not delete Term Requests after they have been validated to ensure content deletions are controlled |
|
| Requests can be made to UC to delete content on the contributor's behalf |
Term Requests cannot be updated once in a validated state | The API Rules Engine must ensure contributors do not update Term Requests after they have been validated to ensure content changes are controlled |
|
| A new version must be created with reference to the prior version |
A new version of a Term Request can be created only when the Term Request is in a validated state | The API Rules Engine must allow contributors to create a new version of a Term Request only when it is in validated state so that content isn’t |
|
|
|
Ability for UC to subsequently invalidate contributed Term Requests | As a UC Administrator I want to change the validation status to invalid for existing validated Term Requests so that I can ensure the quality of the content |
|
| UC may determine that Term requests that should not have made validate (poor quality, duplicates …) can subsequently invalidate documents. |
Limit number of Term Request API calls per type per period | As the API Rules Engine I want to ensure that Term Requests are not manipulated too quickly as to cause performance issues or block contributors from creating chaos |
|
|
|
Open Questions
List any open questions that come to mind throughout the lifecycle of this project
Question | Answer | Date Answered |
---|---|---|
What is needed to be saved in the global/common dictionary and why? |
|
|
Assuming we save definitions in the global/common dictionary with reference to the sources, how do we handle changes from the source dictionaries? |
|
|
What is required to be saved locally for the NLP processing? |
|
|
What is required to be saved locally for manual tagging? |
|
|
What other services / functionality require terms to be persisted locally and what do they need? |
|
|
Do we need a new database and schema for the Federated Dictionary, or can we use the existing Compliance Dictionary? and why? |
|
|
Out of Scope / Future Functionality
History and versions
Impacted Product Components
The automatic NLU tagging process
Reference to Citations and Common Controls
User Interaction and Design
Link to mockups, prototypes, or screenshots related to the requirements.
Process Flow Diagrams
Links to user journeys, process flow, or other diagrams related to the requirements.
Guides
If there are UI components to this requirement, list the main areas where interactive user guides would be needed.
Additional References
Proposed Architecture for Federated Dictionary
Article validation - Meta (wikimedia.org)
Outdated Wikipedia Proposed Approval Mechanism
Competitors/Partners
Name | domain | alexaUsRank | alexaGlobalRank | trafficRank | Employees | employeesRange | marketCap | annualRevenue | estimatedAnnualRevenue |
---|---|---|---|---|---|---|---|---|---|
Merriam-Webster | merriam-webster.com | 296 | 535 | very_high | 90 | 51-250 | null | null | $10M-$50M |
| wikipedia.org |
|
|
|
|
|
|
|
|
| wiktionary.org |
|
|
|
|
|
|
|
|
| oed.com |
|
|
|
|
|
|
|
|
| dictionary.com |
|
|
|
|
|
|
|
|
| dictionary.cambridge.org |
|
|
|
|
|
|
|
|
| wordnik.com |
|
|
|
|
|
|
|
|
| britannica.com |
|
|
|
|
|
|
|
|
| collinsdictionary.com |
|
|
|
|
|
|
|
|
| macmillandictionary.com |
|
|
|
|
|
|
|
|
| Pearsons.com |
|
|
|
|
|
|
|
|
| thefreedictionary.com |
|
|
|
|
|
|
|
|
| vocabulary.com |
|
|
|
|
|
|
|
|
"https://company-stream.clearbit.com/v2/companies/find?domain=merriam-webster.com"
{
"id": "43c2daf2-402e-4e30-9bbc-e765f6bd2ba9",
"name": "Merriam-Webster",
"legalName": null,
"domain": "merriam-webster.com",
"domainAliases": [
"merriam.com",
"word.com",
"becomingbankable.com",
"m-w.com",
"myspellit.com",
"webster.com",
"wordcentral.com",
"merriam-webster.biz",
"merriam-webster.info",
"meriamwebster.com",
"wordfind.net",
"merriamwebster.com",
"cdn-mw.com",
"learnerdictionary.com",
"marianwebster.com",
"webster-mobile.com",
"websterunabridged.com",
"webstersthird.com",
"merriam-websterunabridged.com",
"merriam-webstersunabridged.com",
"merriam-websterunabridged.net",
"merriam-webster.net",
"merriamwebster.net",
"meriam-webster.com",
"m-w.org",
"mirriamwebster.com",
"merriamwebster.org",
"m-w.info",
"merriam-webster.org",
"miriamwebster.com",
"spellcheck.com",
"learnersdictionary.biz",
"websters.info",
"learnersdictionary.info",
"m-wu.com",
"mobile-webster.com",
"merriam-websterunabridged.biz",
"merriam-websterunabridged.org",
"webstersunabridged.com",
"merriam-websterunabridged.info",
"merriamwebsterunabridged.com",
"merriamwebstersunabridged.com",
"unabridgedpreview.com",
"learnersdictionary.org",
"learnersdictionary.net"
],
"site": {
"phoneNumbers": [
"+1 413-734-3134",
"+1 413-731-5979"
],
"emailAddresses": [
"customerservice@merriam-webster.com",
"privacy@m-w.com",
"dpo@m-w.com",
"GDPR_EURep@m-w.com",
"permissioneditor@merriam-webster.com"
]
},
"category": {
"sector": "Information Technology",
"industryGroup": "Software & Services",
"industry": "Internet Software & Services",
"subIndustry": "Internet",
"sicCode": "27",
"naicsCode": "32"
},
"tags": [
"E-commerce",
"Internet",
"Technology",
"Publishing",
"B2C"
],
"description": "Merriam-Webster, Inc. is an American company that publishes reference books and is especially known for its dictionaries.",
"foundedYear": 1831,
"location": "PO Box 281, Springfield, MA 01102-0281, US",
"timeZone": "America/New_York",
"utcOffset": -5,
"geo": {
"streetNumber": "281",
"streetName": "PO Box",
"subPremise": null,
"streetAddress": "281 PO Box",
"city": "Springfield",
"postalCode": "01102",
"state": "Massachusetts",
"stateCode": "MA",
"country": "United States",
"countryCode": "US",
"lat": 42.17073,
"lng": -72.60484
},
"logo": "https://logo.clearbit.com/merriam-webster.com",
"facebook": {
"handle": "merriamwebster",
"likes": 357987
},
"linkedin": {
"handle": "company/merriam-webster-inc-"
},
"twitter": {
"handle": "MerriamWebster",
"id": "97040343",
"bio": "Word of the Day, facts and observations on language, lookup trends, and wordplay from the editors at Merriam-Webster Dictionary.",
"followers": 999953,
"following": 689,
"location": "Springfield, MA",
"site": "https://t.co/ezW3fH0kGo",
"avatar": "https://pbs.twimg.com/profile_images/677210982616195072/DWj4oUuT_normal.png"
},
"crunchbase": {
"handle": "organization/merriam-webster"
},
"emailProvider": false,
"type": "private",
"ticker": null,
"identifiers": {
"usEIN": null
},
"phone": null,
"metrics": {
"alexaUsRank": 296,
"alexaGlobalRank": 535,
"trafficRank": "very_high",
"employees": 90,
"employeesRange": "51-250",
"marketCap": null,
"raised": null,
"annualRevenue": null,
"estimatedAnnualRevenue": "$10M-$50M",
"fiscalYearEnd": null
},
"indexedAt": "2022-12-01T07:22:35.526Z",
"tech": [
"google_apps",
"aws_route_53",
"sendgrid",
"nginx",
"google_tag_manager",
"jw_player",
"google_analytics",
"app_nexus",
"media.net",
"marchex",
"appnexus",
"apache_http_server",
"dropbox",
"turn",
"entrust",
"dstillery",
"openx",
"mediamath",
"basecamp",
"the_trade_desk",
"mongodb",
"microsoft_project",
"ibm_cognos",
"pubmatic",
"datadog",
"rubicon_project",
"oracle_peoplesoft",
"sugarcrm",
"google_search_appliance",
"cj_affiliate",
"bluekai",
"acxiom",
"netsuite",
"stackadapt",
"postgresql",
"mysql",
"applepay",
"admeld",
"appier",
"salesforce_dmp",
"aggregate_knowledge",
"atlassian_jira",
"oracle_hyperion",
"iponweb_bidswitch",
"zedo"
],
"techCategories": [
"productivity",
"dns",
"email_delivery_service",
"web_servers",
"tag_management",
"image_video_services",
"analytics",
"advertising",
"marketing_automation",
"data_management",
"security",
"adverstising",
"database",
"monitoring",
"business_management",
"crm",
"payment",
"project_management_software"
],
"parent": {
"domain": null
},
"ultimateParent": {
"domain": null
}
}