Social Networking, Semantic Searching and Science
Executive summary: The tools and tricks of scientific collaboration are still pretty old school. With the ivory tower not being a major profit center, how can innovations in the private sector (which far outstrip academia’s capabilities) be brought over to accelerate scientific research and discovery? (Caveat: I have no answers, just a problem statement!)
One “feature request for the online universe” that I still carry from my previous science career can be loosely characterized by the following questions:
- “I’m researching topic X. What are the seminal papers in this area? Who are the primary researchers whose work I shoud read?”
- “Ten years ago I published a paper on topic Y, then got distracted by grant funding in another area. I’d like to understand the full ‘intellectual lineage’ of this paper, now that others have had a chance to chew on it. What has it lead to? Have questions been answered? Have new ones emerged?”
- “Who are the currently active researchers working on topics most closely related to my own research? Are there any bright new stars whose work I should keep an eye on?”
The third mockup is heavily inspired by the TouchGraph network mapping utility applied to Facebook, as shown below.
In TouchGraph/Facebook, individuals manually attach themselves to regional or institutional networks, and the graph organizes people clusters to be proximate to their networks. Swap “manual regional network” for “automatically identified research theme” in the mockup above, but the same basic concept holds.
Scientists (or, I suppose, any scholarly researchers) know that the traditional solutions to these questions are social, and/or “manual”. Researchers typically get the answer to (1) in their formal graduate education (which may lead to “frozen in time” syndrome), by word of mouth, or by roll-up-the-sleeves painstaking literature searches. These approaches work reasonably well, although they tend to marginalize those without access to a rich social network (graduate students in a small department, researchers with limited conference travel resources, those in developing countries, those looking to bridge disciplines, etc). Conventional online social networks may offer some relief there, but as this commentary notes, social networking does not seem to be making rapid inroads into scientific communities, because the ROI isn’t readily apparent. (I’ll second this observation with anecdotal evidence of the puzzled, head-scratching virtual looks I get when I invite old colleagues to join LinkedIn. In fairness, LinkedIn perhaps has yet to demonstrate its value beyond contact management and general nosiness appeal).
Questions (2) and (3) might be waved off with the “expectation of currency” : active researchers in a field are expected to regularly review all new publications and keep abreast of those related to their interests. This is a legitimate expectation, albeit with a few caveats. First, this expectation tends to reinforce trends towards increased stovepiping and niche expertise, trends which I would suggest encourage the further “industrialization” of science, and work against scientific innovation. Second, it has limited sustainability, since overall scientific research output continues to increase (data from 1991-1998 indicate 2-10% growth per year, depending on region, and it’s probably safe to say that trend has not decreased with the emergence of e-publication efficiencies after 1998):
“But wait, there’s Google Scholar…” Yes, indeed there is, and it’s a good start. But I maintain that as wonderful as Google is, its output is still a painfully inefficient drop into actual research workflows. A checklist is still a checklist, and Google “finds” are still granular items stripped of their semantic context. I argue that for scholarly publications, that semantic context has two components: the history of (and followup from) individual papers, and the continuity in thinking and theory of specific authors. On a long time scale, scholarly research and publication is a knowledge process, and existing search tools cannot look forwards or backwards into the links in the knowledge chain. That currently happens only in the wetware of the individual researcher/scientist. I find this frustrating, since we are rapidly approaching the point when all of the necessary raw data will be accessible online, but the “back end” linkages, tools and standards to extract knowledge from that data are still lagging.
A pause for context, and a “prize dataset”
Why am I fired up about this topic? I’ll share a little history which might illustrate my frustration on the gap between what’s possible and what we have, as well as a “prize dataset” to possibly use going forward. From 1995-2005 I was fortunate enough to participate in the “Information Systems Committee” of the American Meteorological Society. The committee name is somewhat misleading: this group served as a strategic planning body for the long-term stewardship of AMS’ scholarly publications, and their migration from print to online journal archival and distribution. AMS staff are the best I’ve ever worked with, and the Society was incredibly forward looking in its approach to the migration. Some of the group’s finest accomplishments:
- Wrestled with online copyright issues far ahead of other groups, and helped define copyright policy standards later adopted by other professional societies.
- Successfully migrated the journal business model from print to online without traumatizing or cannibalizing other AMS business areas – again, far ahead of the curve.
- In this migration, converted the complete history of AMS publications to digital format, retaining its content. I.e., not just “digital photocopies”. This is incredibly important, and a practice which many other societies did not follow in the mad late-90’s rush to convert legacy materials.
- Migrated content not directly to HTML, but to a core SGML format which included basic semantic metadata and tagging. If your eyes are glazing over at this point, here’s the impact: all papers are/can be rendered on the fly not only in user interface standards of today (PDF/HTML) but into whatever standards emerge in the future. (“It’s the content, stupid.”)
- Identified a controlled vocabulary for the discipline of atmospheric science (the AMS Glossary). This will be important later on when it comes to semantic analysis.
- Fully embraced the “persistent URL” DOI standard employed by CrossRef, ensuring survivability and accessibility of all AMS content, and allowing built-in reverse and forward indexing of citations (BINGO!).
- Achieved all of this with a business model that ensured historical content could be made freely available to the public, after a reasonable period from its initial publication.
Here’s an example of many of these features in action (an old paper of mine). Note the persistent URLs, reverse/forward citations, as well as the full content rendered “on-the-fly” from a parent dataset into multiple formats (HTML, PDF). This is all now Standard Operating Procedure at AMS.
Why the brag session? Because of all these choices were made with an eye towards maximizing what could be done with the scientific knowledge that AMS stewards not based on the capabilities of today, but based on the capabilities we probably will have 10-20 years out, given advances in online protocols and standards, computing horsepower, lexical and semantic analysis, etc. AMS thus has a perfect “prize dataset” to experiment with some of those far-future capabilities today. Even if the rest of the world is bogged down in proprietary solutions, limited data, and jostling standards, the AMS repository could be used to prototype what “should” be possible five to ten years out.
A few key advances are needed to truly achieve knowledge-based processing of the wealth of online scholarly publications now (and in future) available:
Gaps: Universal Connectivity
Complete embracing of the CrossRef DOI standard (or something very similar to it) is a must. CrossRef is the social network “core asset” that binds scholarly publications together, through time. While its basic unit is the publication itself, the connectivity and continuity of individual authors’ thinking is something of a derived product. (Indeed, I’m tempted to think of scientists as secondary “processing” nodes in a greater flow in which the publications are the truly meaningful nodes.)
Gaps: Operationalized Semantic Analysis
Currently exploratory techniques in semantic analysis must begin to find their way into the mainstream. “Keyword frequency count” type searching is simply too rudimentary for the types of knowledge mining needed for scholarly publications. A key issue here is scientific (or indeed, any scholarly) jargon. Key words and phrases primarily have meaning only within a specific context, and typically hide a much deeper set of implied meetings and contexts. For automated tools to learn these contexts and connections, much more sophisticated approaches are needed.
Self discovered network maps could be one approach, but a potentially more lucrative tack would be to treat it as a generic problem of nonlinear dimensional reduction, i.e., collapsing a large number of words and phrases (a scientific journal article, or corpus of journal articles … i.e., a very high dimensional dataset) to a much smaller number of contextual dimensions; a “coordinate system for meaning”. (I believe there’s some overlap with the concept of ontologies here. I tend to think we’re likely to make most progress with “supervised self-discovered”, rather than community-developed, ontologies. Although I wouldn’t go quite so far as the metacrap diatribe level of cynicism.).
That’s pretty abstract, but consider the examples of self discovered keyword organization by Roweis at al using Locally Linear Embedding:
A low dimensional “coordinate vector” for a set of keywords, concepts or themes would allow true “quantitative” proximity detection searching between publications or documents. Just for fun, here are a couple of other examples of LLE in action, just because it’s so cool.
Something about developing quantitative coordinate systems for complex meanings presses all my “rife with opportunity” buttons. I’d strongly recommend checking out Saul & Roweis and Seung and Lee’s work, starting with “The Manifold Ways of Perception”.
Gaps: New Intellectual Property Tools
Assuming that the development of semantic analysis schemes is probably beyond the capabilities or resources of most professional societies or commercial scientific journal publishers, some other organization (whether public or private) will probably end up doing it. For this, we need copyright license schemes which allow access to full text content, but limited only to the applications of indexing, semantic analysis and knowledge mining. Users of these licenses would be prohibited from redistribution of anything other than semantic extracts of source documents. (Actually, perhaps new tools are not even needed. What I am proposing as an algorithm is precisely what occurs within every scientist’s head using the subscriptions and license permissions they already have. Somehow I suspect this nuance will be lost the first time a commercial vendor seeks knowledge-mining access to a for-profit professional society’s journal assets…)
If this issue ends up getting really thorny, one potential solution might be to involve some trusted and neutral third party, such as the U.S. Library of Congress, as the “knowledge mining broker”.
Gaps: Bringing it all together
The final step would be the marriage of semantic analysis and the “journal social network” provided by CrossRef to enable truly value-added scientific social networking. The strengths of network connections would be determined by semantic matching, thus allowing the very large reverse and forward networks to be pruned to the most relevant “threads” of thought, which is precisely the quantum leap needed to move beyond Google-type searching. This convergence of capabilities would, in theory, all the examples at the start of this article to become the norm, rather than pipe dreams.
Extensibility (or Bootstrappability)
As for piloting such capabilities, I suggested above that the AMS repository would make an excellent pilot database. However, because the problem is general (see below), there are many possible test data sets. In theory, smaller / private wiki databases could also be used to test a pilot, if there was preservation of author participation in the wiki system (I honestly don’t know enough about wiki under-the-hood capabilities to know if this could work – if it could, then perhaps partnership with Wikipedia itself would provide a launching platform to take such capabilities to national visibility).
My motivations here are largely centered on scientific research, but this is really just a special case of the problem of “getting Knowledge Management Systems to start delivering some payoff”. Knowledge management systems have been on the corporate scene for quite some time, yet consistently languish when it comes to user satisfaction. (See the Bain Management Tool Survey 2009 , slide 15) (And from my personal experience, rarely go much beyond the “common data repository” stage – at best). I suspect the immaturity of back-end, knowledge extraction capability is the stumbling block: without this capability, it becomes difficult to justify the very large startup and maintenance costs of knowledge management systems efforts (not to mention the significant additional overhead on individual workers and contributors to populate and maintain the content). This article on knowledge management system challenges is a bit old, but has some very good perspectives.
Precisely because of this coupling to knowledge management and the business sector, I worry that the potential payoff of these semantic search capabilities is so high that their first operational emergence will be proprietary and patented, and that this in turn will price the capabilities beyond the reach of nonprofit or public sector research organizations. This would be very bad for our national “competitive edge”. In a world where our information (knowledge) assets are a key source of competitive advantage, a national (public sector) investment in our ability to access and utilize this knowledge seems about as fundamental and foundational an investment as there could possibly be. At the risk of sounding hyperbolic, I’d liken this to DARPA pioneering of the internet. The national investment helped lead to global capabilities, but because of our relative global position in intellectual capital, the U.S. economy was at the forefront of profiting from the infrastructure investment.