PRD - Federated Dictionary

Overview

Background

Online dictionaries such as Merriam-Webster, Oxford English Dictionary, and Cambridge Dictionary have become commonplace for writers and readers who rely on dictionaries to get a clear meaning of words they come across or intend to use. Specialists such as lexicographers and glossarists rely on these dictionaries to research existing terminological entries and may utilize many different online dictionaries while conducting their research. Many of these dictionaries also have API capabilities with disparate API definitions with no central hub connecting all dictionaries into a federated system.

Unified Compliance has its own Compliance Dictionary which has been built up over years primarily focused on terms (mostly nouns and verbs) found in authority documents. These terms are used during the tagging process (automated with NLP and manual) to help break down authority documents into citations then into actionable mandates which can then be mapped to common controls.

Objective

To help fulfill Unified Compliance’s mission statement to be THE authoritative platform for regulatory content, control frameworks, and policy guidance, one of the initiatives is to build products and functionality that extend the usage beyond professional lexicographers and glossarists to include many additional users to allow them to contribute to UC’s content repository.

To encourage and facilitate a wider set of users utilizing UC functionality, the objective of the Federated Dictionary is to create a single-point federated research dictionary that leverages many dictionary APIs together allowing researchers (professional and non-professional alike) to more quickly and easily research and contribute terms a common dictionary. By increasing the content in the Federated Dictionary, we will make it substantially easier for other documents to be ingested (programmatically or manually) into the UCF.

The Federated Dictionary and related APIs will need to support a workflow including expert validation tasks to ensure the dictionary entries are appropriately defined. However, the workflows must be made simple and efficient to expedite content contribution.

Personas

Note: personas are in process of being built. Once in place, the links will be provided to those personas.

  • UC Mapper (linguist, English major, attention to detail …)

  • UC Mapper Approver (linguist, subject matter experts, …)

  • Partner Contributor (expert in their software and processes, not a linguist, busy with other tasks …)

  • Customer Contributor (institutional organization knowledge, domain expert, not a linguist, likely minimal compliance knowledge …)

Success Metrics

List project goals and the metrics we’ll use to judge success

Goal

Metric

Goal

Metric

Expand the automatic and manual tagging process to use dictionaries beyond UC’s Compliance Dictionary

Any document, well beyond those produced by standards bodies like PCI-DSS and NIST, can easily be tagged (both automated and manual) utilizing terms found in third-party dictionaries

Expand dictionary usage and term population beyond the UC Mapper team

Substantially increase the number of users outside of Unified Compliance contributing terms to the Federated Dictionary

Current Challenges / Limitations

Time Consuming: The UC Compliance Dictionary includes terms (nouns and verbs) including multiword expression terms that have been built up over time. However, even with this expansive dictionary, on average, 24 new terms are required to be researched and defined each time a new Authority Document is added to the UC’s content repository. Researching, defining, and adding new terms to the Compliance Dictionary takes time and effort, lengthening the tagging and mapping process.

An example of how a researcher would performs their research is as follows:

The researcher loads a new tab for the dictionary of choice and performs a search. If there is a response that the researcher is satisfied with, and the researcher doesn’t wish to search further, the researcher then can cite the selected response including the term-definition pair, citation, and attribution. However, if the researcher isn’t satisfied with the response (or there is no response at all), the researcher will then navigate to another dictionary site and perform the same search again. If the researcher receives multiple responses from different sites, the researcher will switch tabs or open them side-by-side in order to compare results.

Limited in Scope: Since the Compliance Dictionary was created with terms primarily related to Authority Documents, it is limited in scope such that documents like an organization’s policies and procedures, or VM security setup guidelines will have many more terms not defined in the Compliance Dictionary making it even more time consuming for these documents to be tagged and mapped.

Lack of Consolidated View: With the proliferation of the online dictionaries including Merriam-Webster, Oxford English Dictionary, Cambridge Dictionary and others and their respective APIs, there is no consolidated / federated interface allowing researchers search across all dictionaries and compare results when conducting term, definition, and thesaurus research making it time consuming and tedious to open multiple browser tabs or windows and perform identical searches on multiple sites.

Benefits

Allowing content contributors (customers, partners, or UC) to have easy access to third party dictionaries will make it substantially faster for contributors to research terms, add terms to the Federated Dictionary, and tag documents.

As the Federated Dictionary grows in content, the automatic tagging process will increase the hit rate of matched terms, reducing the overall time for research including the dictionary update process.

As the Federated Dictionary grows in content, many other documents outside of UC’s domain expertise and organization-specific content (e.g., policies and procedures) can quickly be tagged and mapped into the UCF.

As the UC Federated Knowledge Graph grows in content and users (persons, organizations, and automated systems), the potential network economics and value increase if monetization and market tipping, critical mass occurs while maintaining security and trust. "Network Effect." Wikipedia, Wikimedia Foundation, 19 Nov. 2022, Network effect. (2022, November 19). In Wikipedia. Accessed 23 Dec. 2022.

Subscription / Pricing / Billing Impacts

Accessing the third-party dictionaries will include a nominal cost. This incremental cost must be passed along to customers.

There is a potential for licensing the Federated Dictionary as a standalone offering if it satisfies an underserved need for researchers or other users by reducing research time, includes additional content such as multi-part words, better representation of relationships, and more.

Beta and Early Access

Alpha and Beta access must be made available to selected members of the UC mapping team as they are the expert end-users of the Authority Document research, tagging, and mapping process.

To prove or disprove the hypothesis that the Federated Dictionary has value outside of Authority Document tagging and research, early access should be given to a wider set of target users and use cases.

Risks and Assumptions

There must be an approval / validation process to ensure the term definition is done properly. However, there must be options to make the workflow easy and fast. Note that Wikipedia has opined on different options to include an approval mechanism as called out in the references below. As they mention the process must not be too rigorous where experts take too long and slow the process down which is what led to Nupedia's failure

There could potentially be multiple contributors wanting to contribute overlapping and contradictory term definitions. We need to put in place mechanisms to allow for multiple definitions (as is done within online dictionaries) and potentially include a ranking or rating mechanism.

One of the downsides of a federated approach is that historical data is only stored in the source systems. Not locally in the federated data store since the federated data store will typically store metadata with reference to the source data and not the data itself. If any change is made in the third-party dictionary, the source dictionary has the responsibility of storing the history. However, if we only store metadata references, the Federated Dictionary will be unaware of the history and unaware of any changes. If we store more than just metadata including the current definition in the dictionary, we will need to put in place an update/sync mechanism and determine whether to keep history or not.

As is today with the Compliance Dictionary, it is assumed that term-definition pairs are required to be saved/persisted into a common/global dictionary (whether this is the compliance dictionary or a new federated dictionary) with reference to sources which will be used for NLP and tagging.

Milestones and Phases

List the project milestones along with how that milestone can be successfully measured.

Number

Description

Success Measurement

Number

Description

Success Measurement

 

Common schema defined and mapped to the five (5) identified dictionary partners of:

  1. Harper Collins

  2. Merriam Webster

  3. Wordnik

  4. Oxford English Dictionary

  5. Pearson

 

 

 

Product Requirements

Use Cases

Happy Day Scenarios

As a researcher, I can easily search for terms and visually see results from multiple dictionaries, I can optionally pick the source definition that most accurately relates to my research and quickly and easily save the reference in the Federated Dictionary for follow-on work.

As a content contributor during the mapping process, I can easily identify a term that was not stored in the Federated Dictionary. I am presented with options from multiple third-party dictionaries, where I pick the best one that relates to the document I am mapping. That term is automatically added to the Federated Dictionary with the term-definition pair, reference to source, synonyms and all other pertinent information.

Rainy Day Scenarios

As a researcher, I see there is an existing term definition in the Federated Dictionary, but that reference and definition does not accurately match the definition for my research. I need to use a different source and want that source referenced in the Federated Dictionary.

As a researcher, I tagged a document a while back with reference to a term-definition pair where that definition came from one of the third-party dictionaries, but now that third party dictionary has updated the definition which no longer fits the cited source document.

Requirements

Requirement

User Story

Importance

Jira Issue

Comments

Requirement

User Story

Importance

Jira Issue

Comments

Dictionary Search

 

 

 

 

Ability for researchers to search for terms using a word, phrase, ID, acronym, or definition

As a researcher I want to search for existing Dictionary Terms so that I can help understand the documents I am reviewing

p1

 

 

Ability to perform an advanced search

 

p2

 

 

Ability for the dictionary to suggest terms with autocomplete as the researcher enters a term

As a researcher I want the dictionary search to propose dictionary terms as I am typing to expedite the search process

p1

 

 

Ability for researchers to view details of an existing term’s dictionary-related details

As a researcher I want to view the details of an existing dictionary term found in the common dictionary so that I can further my research

p1

 

Must include details like one sees today in the Compliance Dictionary including acronyms, preferred term, non-standard terms, definitions (type, definition and source citation), other forms (plural, plural possessive, possessive), and relationships

Ability for researchers to view details of third-party dictionary results in addition to existing dictionary related results

As a researcher I want to view the details of third-party dictionary terms so that I can quickly perform my research without having to jump from dictionary to dictionary

p1

 

There should be an option for the researcher to display third-party results in addition to the common dictionary or display those results when no common dictionary term is found

Ability for researchers to view details of an existing term’s mapping-related details

As a researcher I want to view the details of an existing dictionary term found in the common dictionary and how it relates to existing documents so that I can further my research

p1

 

Must include details like one sees today in the Compliance Dictionary including related common controls (tagged and not-tagged terms), related citations (tagged and not-tagged terms)

Dictionary Term Request

 

 

 

 

Ability for researchers to submit a Term Request for a new term based on an integrated cited third-party dictionary source

As a content contributor I want to submit a request for a new Term with reference to a third-party dictionary that I want added to the dictionary so that my team can reference that term

p1

 

Who should be able to add terms from other dictionaries?

Will need some governance

Ability for researchers to submit a Term Request for a new term without citing an integrated dictionary source

As a content contributor I want to submit a request for a new Term without reference to a third-party dictionary that I want added to the dictionary so that my team can reference that term

p1

 

These could be multi-part words not defined in any of the dictionaries or other newer terms not yet defined

Will need some governance

Ability for researchers to search for existing Term Requests

As a content contributor I want to search for existing Term Requests so that I am aware of its status and can take further actions

p1

 

 

Ability for researchers to view, edit, delete, and withdraw existing Term Requests

As a content contributor I want to view, edit, delete, and withdraw existing requests so that I can manage the queue of work I want done

p1

 

 

Ability for reviewer(s) to validate and approve new terms, updates to existing term, and newly cited terms

As a reviewer I want to review and approve the Term Request details to ensure it has all the pertinent information

p1

 

Similar to what is defined in the Term Request process, need ability to set 0 to N number of reviewers in the business rules

 

Ability to check for duplicates

As a content contributor I want the system to automatically check for duplicates to ensure the Dictionary isn’t populated with overlapping redundant data

p1

 

 

Ability to submit a Term Request for a new term, update existing, or newly cited term as private

As a content contributor I want the Term to be visible only to users in my organization so that no other organizations are able to access our private and/or proprietary content

p2

 

 

Ability to submit an updated term definition and keep the version and history

As a content contributor I want to submit a subsequent version of the Term to go through the Term Request process so that any changes can be tracked

p2

 

Need to discuss this one since if terms have relationships with Citations or Common Controls, changing the definition could weaken the reference

Review Workflow

 

 

 

 

Ability for account administrators to setup business rules for validation of a term

As an account administrator I want to manage the submission and review workflow to best match the structure of my team so that we can submit terms as efficiently as possible

p1

 

For some contributors, they might not want any review since the person submitting the term is the domain expert. Other accounts may want one level of approval

Ability to setup task owners as a person, role or group

As an account administrator I want to flexibly manage the assignment of tasks in the term definition workflow process to control the queue of work so that I can best manage the workload of my team

p1

 

Above requirement was for the number of steps. This one is for who is assigned to those steps.

Indirect assignments like roles and groups makes it much easier to maintain and quickens onboarding / offboarding efforts.

Large clients need to have many people

Ability to submit unvalidated Term Requests

As a content contributor I want to contribute terms that are automatically approved with no need for validation so that my terms can be quickly used for tagging

p1

 

These terms must be marked as “unvalidated” to help end-users know that those terms should be used with caution

Contribution Managment and API Protection

 

 

 

Most/all of the Contribution Management requirements are intended to be used as a guideline for all other objects that we open up to contributors including authority documents and dictionary terms. These rules must be enforced within APIs to protect the content quality

Ability for UC to revoke or allow organizations to contribute dictionary terms

As a UC Administrator I want to allow or revoke the ability of an organization to contribute so that UC can control who can contribute

The API rules engine must check this setting to allow contribution

 

 

The initial gating factor for contribution.

Ability for UC to revoke or allow contributors to validate their own dictionary terms

As a UC Administrator I want to allow or revoke the ability of a contributor to verify their own Terms without UC intervention so that UC can protect the UCF content quality while allowing others to contribute

 

The API rules engine must check this setting to allow validation of own content

 

 

Validation permissions can be set at an organization, group, role or person level. Initially the organization level will suffice.

If not allowed, other designated Organizations (starting with UC) must perform the validation. Later releases can then allow other Organizations to be designated as validators by UC.

Changes must go into effect immediately and retroactively to ensure that any content that is currently in process will follow the updated rule.

How can we detect collusion?

Ability for UC to designate an organization as a certified validator

As a UC Administrator I want to designate whether an organization is a certified validating entity so that we can control validation and inform users of the validation certification

 

The API rules engine must check this setting to determine whether organization is certified or not

 

 

This will help to curtail collusion by only allowing certified organizations to validate content created by others.

As a subsequent step we could limit the areas of validation to geography, subject matter, and/or industry.

Ability for UC to set validation indicator for contributors who are allowed to validate their own Terms

As a UC Administrator I want to inform users that content was contributed without formal validation by a certified validator so that users can make an informed choice to use unvetted content or not

 

The API rules engine must set the certification indicator (or not) based upon the organization’s certification status

 

 

If we allow contributors to validate their own content, UC can decide whether or not to designate the contributed content has been validated by a certified validator.

This will allow contributors to validate their own content and allow UC one additional instrument to inform others.

UC can require a certification process which must be periodically completed to allow for certification.

Ability for UC to set and enforce minimal number of validation workflow steps

As a UC Administrator I want to determine the minimal number of validation steps so that UC can increase the likelihood of quality content

The API rules engine must check this setting to determine and enforce number of validation steps

 

 

When set to 0, an API call must still be made to designate the Term has been validated and can be made using the same person / auth key. Otherwise, the Term Request stays in and unvalidated state.

When set to 1 or more, an approval workflow must be in place otherwise content cannot be added. The approval workflow will designate the group, role, or person. Each person / auth key must be different for each step in the approval flow else the validation cannot be performed.

If the organization is not allowed to perform their own validation, a final validation step will be required as mentioned in the prior requirement

Term Requests can only be updated or deleted by the contributor owner (person or organization) or UC

The API Rules Engine must ensure contributors only update their own content so that content updates can be controlled

 

 

Ensures contributors only update their own content

Term Request owners can be changed by UC

As a UC Administrator I want to change the owner of a Term Request so that it can be further processed

 

 

If a person or organization no longer exists or is determined to be a “bad actor”, there needs to be a way to change the owner to someone else so that subsequent changes can be made

Term Requests cannot be deleted once in a validated state

The API Rules Engine must ensure contributors do not delete Term Requests after they have been validated to ensure content deletions are controlled

 

 

Requests can be made to UC to delete content on the contributor's behalf

Term Requests cannot be updated once in a validated state

The API Rules Engine must ensure contributors do not update Term Requests after they have been validated to ensure content changes are controlled

 

 

A new version must be created with reference to the prior version

A new version of a Term Request can be created only when the Term Request is in a validated state

The API Rules Engine must allow contributors to create a new version of a Term Request only when it is in validated state so that content isn’t

 

 

 

Ability for UC to subsequently invalidate contributed Term Requests

As a UC Administrator I want to change the validation status to invalid for existing validated Term Requests so that I can ensure the quality of the content

 

 

UC may determine that Term requests that should not have made validate (poor quality, duplicates …) can subsequently invalidate documents.

Limit number of Term Request API calls per type per period

As the API Rules Engine I want to ensure that Term Requests are not manipulated too quickly as to cause performance issues or block contributors from creating chaos

 

 

 

Open Questions

List any open questions that come to mind throughout the lifecycle of this project

Question

Answer

Date Answered

Question

Answer

Date Answered

What is needed to be saved in the global/common dictionary and why?

 

 

Assuming we save definitions in the global/common dictionary with reference to the sources, how do we handle changes from the source dictionaries?

 

 

What is required to be saved locally for the NLP processing?

 

 

What is required to be saved locally for manual tagging?

 

 

What other services / functionality require terms to be persisted locally and what do they need?

 

 

Do we need a new database and schema for the Federated Dictionary, or can we use the existing Compliance Dictionary? and why?

 

 

Out of Scope / Future Functionality

History and versions

Impacted Product Components

The automatic NLU tagging process

Reference to Citations and Common Controls

User Interaction and Design

Link to mockups, prototypes, or screenshots related to the requirements.

Process Flow Diagrams

Links to user journeys, process flow, or other diagrams related to the requirements.

Guides

If there are UI components to this requirement, list the main areas where interactive user guides would be needed.

Additional References

Proposed Architecture for Federated Dictionary

End User Mapping Charter

Dictionary is Dead

Article validation - Meta (wikimedia.org)

Outdated Wikipedia Proposed Approval Mechanism

 

Competitors/Partners

Name

domain

alexaUsRank

alexaGlobalRank

trafficRank

Employees

employeesRange

marketCap

annualRevenue

estimatedAnnualRevenue

Name

domain

alexaUsRank

alexaGlobalRank

trafficRank

Employees

employeesRange

marketCap

annualRevenue

estimatedAnnualRevenue

Merriam-Webster

merriam-webster.com

296

535

very_high

90

51-250

null

null

$10M-$50M

 

wikipedia.org

 

 

 

 

 

 

 

 

 

wiktionary.org

 

 

 

 

 

 

 

 

 

oed.com

languages.oup.com (used by Google)

 

 

 

 

 

 

 

 

 

dictionary.com

 

 

 

 

 

 

 

 

 

dictionary.cambridge.org

 

 

 

 

 

 

 

 

 

wordnik.com

 

 

 

 

 

 

 

 

 

britannica.com

 

 

 

 

 

 

 

 

 

collinsdictionary.com

 

 

 

 

 

 

 

 

 

macmillandictionary.com

 

 

 

 

 

 

 

 

 

Pearsons.com

 

 

 

 

 

 

 

 

 

thefreedictionary.com

 

 

 

 

 

 

 

 

 

vocabulary.com

 

 

 

 

 

 

 

 

"https://company-stream.clearbit.com/v2/companies/find?domain=merriam-webster.com" { "id": "43c2daf2-402e-4e30-9bbc-e765f6bd2ba9", "name": "Merriam-Webster", "legalName": null, "domain": "merriam-webster.com", "domainAliases": [ "merriam.com", "word.com", "becomingbankable.com", "m-w.com", "myspellit.com", "webster.com", "wordcentral.com", "merriam-webster.biz", "merriam-webster.info", "meriamwebster.com", "wordfind.net", "merriamwebster.com", "cdn-mw.com", "learnerdictionary.com", "marianwebster.com", "webster-mobile.com", "websterunabridged.com", "webstersthird.com", "merriam-websterunabridged.com", "merriam-webstersunabridged.com", "merriam-websterunabridged.net", "merriam-webster.net", "merriamwebster.net", "meriam-webster.com", "m-w.org", "mirriamwebster.com", "merriamwebster.org", "m-w.info", "merriam-webster.org", "miriamwebster.com", "spellcheck.com", "learnersdictionary.biz", "websters.info", "learnersdictionary.info", "m-wu.com", "mobile-webster.com", "merriam-websterunabridged.biz", "merriam-websterunabridged.org", "webstersunabridged.com", "merriam-websterunabridged.info", "merriamwebsterunabridged.com", "merriamwebstersunabridged.com", "unabridgedpreview.com", "learnersdictionary.org", "learnersdictionary.net" ], "site": { "phoneNumbers": [ "+1 413-734-3134", "+1 413-731-5979" ], "emailAddresses": [ "customerservice@merriam-webster.com", "privacy@m-w.com", "dpo@m-w.com", "GDPR_EURep@m-w.com", "permissioneditor@merriam-webster.com" ] }, "category": { "sector": "Information Technology", "industryGroup": "Software & Services", "industry": "Internet Software & Services", "subIndustry": "Internet", "sicCode": "27", "naicsCode": "32" }, "tags": [ "E-commerce", "Internet", "Technology", "Publishing", "B2C" ], "description": "Merriam-Webster, Inc. is an American company that publishes reference books and is especially known for its dictionaries.", "foundedYear": 1831, "location": "PO Box 281, Springfield, MA 01102-0281, US", "timeZone": "America/New_York", "utcOffset": -5, "geo": { "streetNumber": "281", "streetName": "PO Box", "subPremise": null, "streetAddress": "281 PO Box", "city": "Springfield", "postalCode": "01102", "state": "Massachusetts", "stateCode": "MA", "country": "United States", "countryCode": "US", "lat": 42.17073, "lng": -72.60484 }, "logo": "https://logo.clearbit.com/merriam-webster.com", "facebook": { "handle": "merriamwebster", "likes": 357987 }, "linkedin": { "handle": "company/merriam-webster-inc-" }, "twitter": { "handle": "MerriamWebster", "id": "97040343", "bio": "Word of the Day, facts and observations on language, lookup trends, and wordplay from the editors at Merriam-Webster Dictionary.", "followers": 999953, "following": 689, "location": "Springfield, MA", "site": "https://t.co/ezW3fH0kGo", "avatar": "https://pbs.twimg.com/profile_images/677210982616195072/DWj4oUuT_normal.png" }, "crunchbase": { "handle": "organization/merriam-webster" }, "emailProvider": false, "type": "private", "ticker": null, "identifiers": { "usEIN": null }, "phone": null, "metrics": { "alexaUsRank": 296, "alexaGlobalRank": 535, "trafficRank": "very_high", "employees": 90, "employeesRange": "51-250", "marketCap": null, "raised": null, "annualRevenue": null, "estimatedAnnualRevenue": "$10M-$50M", "fiscalYearEnd": null }, "indexedAt": "2022-12-01T07:22:35.526Z", "tech": [ "google_apps", "aws_route_53", "sendgrid", "nginx", "google_tag_manager", "jw_player", "google_analytics", "app_nexus", "media.net", "marchex", "appnexus", "apache_http_server", "dropbox", "turn", "entrust", "dstillery", "openx", "mediamath", "basecamp", "the_trade_desk", "mongodb", "microsoft_project", "ibm_cognos", "pubmatic", "datadog", "rubicon_project", "oracle_peoplesoft", "sugarcrm", "google_search_appliance", "cj_affiliate", "bluekai", "acxiom", "netsuite", "stackadapt", "postgresql", "mysql", "applepay", "admeld", "appier", "salesforce_dmp", "aggregate_knowledge", "atlassian_jira", "oracle_hyperion", "iponweb_bidswitch", "zedo" ], "techCategories": [ "productivity", "dns", "email_delivery_service", "web_servers", "tag_management", "image_video_services", "analytics", "advertising", "marketing_automation", "data_management", "security", "adverstising", "database", "monitoring", "business_management", "crm", "payment", "project_management_software" ], "parent": { "domain": null }, "ultimateParent": { "domain": null } }