Proposal for a Universal Database

Posted in About Technology at 1:37 am by Administrator

A Proposal for a Universal Database

© 2007 Frederick P. Gault, Jr.

Science Fiction?

We’ve all seen a Science Fiction Movie where the protagonist queries a computer system in natural language, and receives exactly the information he needs. With the improvements in digital technology such a database of all human knowledge can be created. But, the creation of this knowledge base is not a function of equipment; rather it is a grand exercise in library science. The challenge is not to create a database with all human knowledge in it, but to create an organized system of information that can be queried to provide the desired result.

So, can a database be created that provides all human knowledge in an organized fashion? The short answer must be ‘No’, given that the human sphere of perception is in constant flux creating new information at a rate that makes a strictly up-to-date knowledge base impossible. But what of a database that is as up-to-date as a modern well maintained library? As we know, a modern library suffers from the severe limitation of the lack of linkage between similar kinds of information. In short, a paper library doesn’t have digital hyperlinks. To be sure, a modern library is well organized, and there are manual mechanisms to assist the user in finding related material. But our science fiction movie protagonist can’t simply ask the library a question. Even asking the human librarian a question isn’t close to our goal of a complete, relevant, topical, quick, correct, concise and organized response to a query.

Knowing the User

William James once wrote: “[knowledge] supposes two elements, mind knowing and thing known”. What we should like to have is a human librarian who knows everything, and more importantly knows us, and can put us together with the information we crave instantaneously. This is the crucial part of any Universal Database. Just as there must be data, and meta-data, there must also be a relationship between the knower and the known.

The Internet?

In some respects the Internet is a sort of incomplete repository of global knowledge. And Search-Engine Technology is one attempt to bring order to the chaos of this modern information miracle. But as a Dewey Decimal System, Google is inadequate, and perhaps always will be, unless there is a concerted effort to digitize and, more importantly, organize human information.

The Internet is an organic public body of knowledge without any universally agreed upon Meta data standards. There are, of course, industry consortiums that have established XML Schema for specific needs. These Schemas are a feeble collection of subsets of what is required for a Universal Knowledge Base.

For example, many texts are not digitized. And even if they were, is that digital text really optimally accessible information? In order for digital information to be truly useful it needs a staggering amount of editorial work.

Benefits: Is it Worth It?

What are the benefits of such a knowledge system? Current efforts to organize and present data, such as can be found in the Wikipedia, are extremely beneficial. However, they fail in one critical area; that of the presentation of information. The user can easily extract raw information from Wikipedia. But this ignores the way humans typically use knowledge. Here is an example:

Current database technology is very good at answering a query such as this:

[Who was Samuel Pepys]?

But it is not so good at answering a query such as this:

[What did Samuel Pepys think of King Charles II of England]?

The latter query requires a synthesis of factual information into a “fuzzier” answer. Now consider this permutation of the latter query:

[What did Samuel Pepys think of King Charles II of England in 1662]?

This query requires a synthesis of factual information along a time axis. In other words Pepy’s opinion of King Charles II may have changed over time.

And consider another factor, the age and knowledge of the person requesting this information:

[Who was Samuel Pepys? I am 11 years old.]


[Who was Samuel Pepys? I am writing a Ph.D. Thesis on Restoration England.]

These two queries should provide very different results, for obvious reasons.

I submit that there is knowledge available to us now that wants only for someone to stumble across related bits of information. What would it be? Perhaps research that shows that a certain drug will cure a disease, or something as startling as that there is intelligent life in the cosmos and we just hadn’t put the data together in the right way to realize it.

Organization of Human Knowledge with Meta Data

In order for any Universal Data Source to be accessible in a meaningful fashion, all information that enters the Database must be edited by humans according to generally accepted standards. This “Meta Data” will serve to organize, categorize and relate disparate facts.

Wikipedia is a mechanism which brilliantly addresses the challenge of editing vast amounts of information and presenting a correct result. Wiki is essentially a public collaboration that is designed to assemble large quantities of information while assuring that it is correct. See (http://en.wikipedia.org/wiki/Wiki).

A current manifestation of wiki is the Wikipedia, (see http://en.wikipedia.org/wiki/Main_Page) which is a web encyclopedia. Wikipedia, shows the practical power of wiki public collaboration. But one can see that Wikipedia is struggling with the broader issues of information technology. For example, Wikipedia has a listing of some 20 languages (as of this writing), which is a subset of the languages spoken on the planet.

There are also challenges associated with individuals who maliciously provide false or highly opinionated information as fact. The general community can move to ban these individuals and correct the information, but with an all volunteer community, it is clear that not all information can be validated.

Factually correct.

This troubling issue is perhaps the most difficult. What does it mean to be “correct” or “factual”? I can offer one possible guide, that being the scientific method. However, even that reflects the prejudice of my world view. Certainly a “Born Again Christian” would not accept this mechanism. Contrast the opinion of an Islamic Scholar with that of a Buddhist Nun and one can see that “truth” is deeply complex concept.


Keeping information up to date is another towering labor-intensive task. This process alone would require an army of individuals to review, enter, and provide meta-data editing.

Time based

The individual requesting information will need to know when humanity became aware of some particular information, and when humanity may have updated that knowledge.
When Meta data is provided for information, the editor(s) will need to mark some information as “unknown”, “incorrect” or “widely believed at the time” over a spectrum of dates in the past, up to the present.

Let me provide an example. Archeologists discovered clay tablets on the Island of Crete written in a script which has come to be known as “Linear B”. As scholars attempted to understand and decode this language time passed. Finally it was determined that Linear B was in fact a type of Greek, and it became possible to read the tablets.

So, let’s image scholars looking at information about Linear B. As of 500 BCE, this script was not known as Linear B, it had an indigenous name, and was well understood on the Island of Crete, which wasn’t called Crete at the time. However, around 1000 AD, no one alive knew of Linear B, what it said or where it was. In the 20th century there were periods where Linear B was known but not decoded, then known and readable.

A scholar then might ask the question, “Who could read Linear B in the year 1000 AD?” and the correct answer would be “Nobody”. But the question “Who could read Linear B in the year 500 BCE?”, would also be “Nobody” because there was no such thing as “Linear B” which is a 20th Century name. However, what the scholar really wants to know is that “The Inhabitants of the Island now known as Crete spoke and read what we now call Linear B but was known to speakers a Greek, in 500 BCE”.

The point is that knowledge evolves over time, and it must have meta-data associated with it that will allow each query to specify a particular point in history.

Social constraints:

1.Some cultures may disagree on what is “true”
2.Some information is of such value that the individual asking for it must pay for it.
3.Some information is secret and available only to authorized people
4.Some information may be censored (leaving the moral implication aside for now).

Facts, Opinion and Entertainment.

Meta data must cover the various strata of information. Some knowledge will be considered facts (according to the limits of how we define “facts” discussed above), some information is simply “Entertainment”. For example, Shakespeare’s Hamlet is not a “fact” it is a play, a work of fiction. In addition there are several different versions of the play that survive. However, there are a variety of “facts” associated with Hamlet, such as when it was written, where and when it was performed, and so on.


“Nodes” of information will require a terse “abstract”. By this I mean that a query will often return oceans of information, and may not be relevant to the user’s search. So, it will be prudent to provide a summary of the information available for presentation, so the user may decline the larger body of data (s)he doesn’t want.

Anyone who has used an online search engine knows of the hopelessness of having hundreds of thousands of responses to a search. Through a concerted effort of editors working to build relevant Meta data, and indeed requests for corrections from users, these summaries can provide a more relevant query mechanism.

Recently a colleague used the Wikipedia to look up what “Bayesian” meant. She had seen the word in conjunction with Spam Filtering. Unfortunately she first spelled it Beysian, which resulted in a “No page with that title exists”. After several tries she got the spelling correct and read a lengthy detailed entry finding the reference to spam filtering near the center of the entry. What she had really wanted to know was a discussion relative to her knowledge level about how Bayesian probability could be used in spam filtering. She didn’t want to know HOW to write a Bayesian Spam Filter, what she wanted to know was, WHAT IS a Bayesian Spam Filter.

My colleague was ultimately able to traverse the database and come up with the information she needed. But we can do so much better.
Edited for “depth”

The “synopsis” of information Nodes is the lowest level of “depth” that should be applied to information. A query maybe made for only an overview of a topic.

Let us take another example: Abraham Lincoln. At one depth the user will be presented with a short paragraph and a photo, giving date of birth, the fact that he was President of the United States, that the US Civil War was conducted during his administration and that he was assassinated. At a deeper depth the user would be given details of his marriage, children, who assassinated him and so forth. At yet a deeper depth, a discussion of his administration, his handling of the Civil War, the emancipation proclamation would be included.

These levels of completeness would increase until each scrap of information related to anyone ever named Abraham Lincoln would be available.
Edited for relationship to other information

Today we are used to information “hyperlinks”. But as the Xanadu project has shown this is a unidirectional single linkage, and not adequate for complex relationships found in the real world.

Presentation of Information to Humans

For even the most well organized database, the user is an integral part of the system. A Universal Database interface should be able to learn about each individual using it. In fact, those individual preferences should be considered part of the knowledge base.

Since I assume that knowledge is defined as the RELATIONSHIP between facts and the individual who consumes those facts, the value of A Universal Database is directly proportional to the ability of the User Interface to “understand” the user.

Of necessity we must extract a subset of parameters based on human behavior to use as a guide to the user’s personality:


Once information is retrieved it must be presented in a language of the user’s choice.


The user will specify at which depth they wish to have their query answered.


This is a tricky question that collides with the definition of “truthfulness”. Individuals of different cultures will certainly have different requirements for presentation.


A child of 5 years old should have their query answered in a more basic way with assumptions about the limits of the child’s understanding. The same query will be answered more completely for a child of 10.


An individual with an IQ of 75 will need a different presentation than a person with an IQ of 100, as will someone with an IQ of 120. This presentation will need to be adjustable in a sensitive way.

Education and Experience

Even an extremely intelligent person may need a “dumbed down” presentation if the topic is one they are unfamiliar with. An individual’s personal experience and education will set levels of presentation.

Expert Systems & Training

A query may not ever be forthcoming. Indeed an individual may want to interact with the Data Source as a learning experience, as if in a classroom. I would expect that institutions of learning will develop curricula that will lead the student through the knowledge, whilst insuring that information is truthful and up to date.

Commercial Advertising

As an aid to product evaluation, humans will want advertising in the body of knowledge in the database.


There are significant barriers to such a Universal Database being created. The most evident problems are standards and editorial effort. First and foremost there should be a global agreement upon the data schema, after which a massive and never-ending editorial effort will commence.


Obviously a world body will be needed to define the schema necessary to support the different data and their associated meta-data. This committee will concern itself only with the mechanics that will allow for “cultural sensitivity” and “truthfulness” without attempting to define them. This body will build a sort of constitution of the government of the Data Source and Query Mechanics.

Attempts have been made to do exactly this, most notably the Xanadu Project (ref).


This committee might be structured in a fashion similar to the way the “Immortal 40” in France. They will safeguard the integrity of the Data Source. Their job will be primarily that of management of the Judiciary, the hierarchy of Node specific committees and general guidance. To this end individuals of stature and integrity will be best.

Due to vast infrastructure and workforce required the management committee will be also be the financial managers of a large corporate enterprise.


A Judiciary should be established to resolve the inevitable conflicts in the interpretation of information. Various levels should be provided that will be topic specific, with access to panels of experts in the field. This “court system” will concern itself strictly with meta-data issues, allowing the global and civilian court systems to deal with issues of ownership.

Digital Rights Management

Ownership issues will, of course, need to be addressed. In short, a “pay as you go” or “micro pay” system seems best. It should be left to the standards committee to decide how payment should work. However, it is safe to say that money will need to be collected for use, not only to compensate content owners, but to provide a “tax” to support the Data Source infrastructure.

That is not to say that parts of the Data Source will not be free. I would expect that large portions will be free, and that mostly entertainment properties will be pay as you go.

Legal Implications

A legal infrastructure will need to be developed to interact with the internal and external court systems. The Data Source will need to have limited liability from the possible misuse of knowledge.
Volume of Information

A robust infrastructure needs to financed and maintained. Even assuming the Internet as hosting the User Interface, the Universal Database will require its own servers, back up, media distribution mechanisms and so on. The volume of information will be so vast and grow so quickly that continual effort will be required to test and improve dedicated facilities.

Editorial Agreement

The Meta-data editorial effort will require a continual collaboration with experts in the relevant fields to a given piece of data. The judicial mechanism should be reserved for only the most intractable problems. It should be understood that A Universal Knowledge Database should accommodate the differing opinions commonly found in various professional disciplines. This does not mean however that differing opinions should have the same weight. For example the theory of evolution is generally accepted despite the fact that some humans accept theological myths as factual descriptions of the explanation for the development of genetic phenotypes.


Parts of the database must be sealed off from general usage. For example: the knowledge necessary for the construction of dangerous weapons may need to be restricted or monitored for global security. Digital rights management is contingent upon convincing the content owners that their property is secure and will be paid for as required. In addition users must have confidence that their costs will be clearly outlined and correct.


Maintenance will perform several roles:
1. Performance – i.e.: response time
2. Correctness
3. Adherence to Database policy as set forth by the committee
4. Completeness
5. Corrections from the user community
6. Improvements to the meta-data
7. Finance, cost adjustment
8. Compliance with content rights

A Proposal

The Committee

A standards committee should be set up to accept global input on a generalized knowledge schema. Specially, the meta-data required should be delineated. The ability for the integration of existing industry schema should be of particular interest. Rules of default automation of existing public domain information should be established. For example: information currently available on web sites should be simply hyperlinked as the default until such time as editors can integrate that information into the Universal database with appropriate Meta-data.

The Staffing

Once the standards committee has determined the schema, it should be tested on a subset of information for evaluation and improvement. This phase of the life of the database will help establish the editorial effort required.

Software tools should be built to integrate existing forms of well-organized information into the database.

Payment System

Funds will be needed to maintain the database and continue to add information and editorial improvement. Several mechanisms should be considered.
1. Participation by government entities, educational institutions and corporate entities that will reap benefit from the product.
2. A pay as you go system for some access quanta as yet to be determined.
3. A tax or subscription service for advanced levels of usage
4. Charges for access to industry-specific domains of information.

International Standards

Query Language

A standardized query API should be established by the committee. This output can then be operated upon by the weeding and training modules. This is analogous to the way SQL is used to query a relational database.
Weeding Module

Software that takes a user’s profile and tailors the query output to his or her level.

Inference Module

Raw factual data must be integrated based on the query presented to the database and the personality profile of the requestor.

Training Module

Software that presents courses derived from the database. Allows teachers to use some sort of scripting to build didactic templates.

Time Relational Data Model

Each bit of data in the database will have the following time-relational Meta data:
1.When the information entered the database
2.When the information was known (a date before entry)
3.When the information was updated
4.When the information update was known
5.When the information was invalidated
6.When the information was accessed
7.When the information security denied access
8.When the information was flagged for possible correction
9.When the information was corrected

In this way it will be possible to “roll-back” the state of the data based on time, updates, corrections and so forth. It will also be possible for researchers to know when particular knowledge was known.

Comments are closed.