The Semantic Web

A major goal of the Semantic Web is to make data open, understandable, and easy for programs to consume. This is in contrast to the state of the practice in which data lives in closed silos, accessed with special-purpose protocols. And today, even if you obtain data from a silo, it still isn’t clear what the data means.

The most immediate of many advantages conferred by the use of Semantic Web technologies is improved data integration. XML is a recognition of the problem, and one step toward a solution. XML tags label bits of data, but their meaning remains implicit in the mind of the tag’s designer or in special-purpose code that processes the data. Semantic Web tags come with machine-understandable definitions in the form of ontologies. They are stored in a uniform format, so your program doesn’t have to know special schema and proprietary protocols to access it. Data integration is eased because the ontologies define type information, acceptable value information, and how each concept relates to others. Your applications gain access to new sources of information that are discoverable and easy to consume. Perhaps more important, you do not have to decide what integrations to support ahead of time and fix that in stone. You can publish information using Semantic Web technologies without knowing who might use the data in the future, but knowing that they will be able to without heroic effort.

Semantic Web standards define how a Web page can contain assertions of fact (using RDF) which can augment the visible page, or form the entire content of the page. A program can visit such a page, recognize these facts, and load them into a data store. The RDF markup on a page gains meaning by conforming to some ontology. This can range from a very simple microformat like vCard for contact information, or Dublin Core for description of and reference to documents, to more extensive ontologies like GoodRelations that describe e-commernce catalog data, to arbitrarily complex ontologies defined using a language called OWL.

Once a program has obtained and stored some RDF data, the program can query it (with a language called SPARQL). It can use the ontology associated with any element to validate that data, use relationships the ontology defines between concepts, and infer new facts. The program can treat the Semantic Web like a kind of database that provides the data for its computations.

Consider one example of how this technology leads to business value. There is a growing movement known as the Linked Data cloud, which consists of Web pages that publish data using RDF, and where sensible, refer to other Linked Data pages that define concepts or provide data relevant for this page. This data is all published in a uniform way, in RDF built on top of XML, so the data integration problem is greatly reduced. For example, a program could read the RDF version of an article in the NYTimes node in the cloud. When a fact in the article references a location, it links to the GeoNames Linked Data node to obtain further information on this location, such as its local name, its latitude/longitude, or the country in which it resides. Other facts might be elaborated by following links into DBPedia, which is the Semantic Web translation of Wikipedia. No one organization has to maintain all this data, and nobody has to plan ahead of time how the data will be read or used. The program will not understand the news article the way a human does, but by accessing many expert sources, it can obtain many salient facts and relations from the story.

Explicit Knowledge can help with the following Semantic Web projects and more: