| Creating a global
knowledge network Dont just clone
the paper methodology
Paul Ginsparg, a
physicist at the Los Alamos National Laboratory, contributed this keynote paper
to the Freedom of Information Conference on the impact of open access on
biomedical research sponsored by the New York Academy of Medicine in July 2000.
He presents an interesting perspective on how electronic publishing options
will affect scholarly communication. Ginsparg developed the worlds first
electronic pre-print archive, originally dedicated to papers in his own field
(high-energy theoretical physics). The archive was quickly extended to cover
other areas of physics, and even other disciplines. Today it regularly
processes between 1,000 and 2,000 electronic transactions per hour (see the Los
Alamos arXiv at:
http://lib-www.lanl.gov/lww/welcome.html).
How should our
scientific research communications infrastructure be reconfigured to take
maximal advantage of newly evolving electronic resources? Rather than
electronic publishing, which connotes a rather straightforward
cloning of the paper methodology to the electronic network, many researchers
would prefer to see the new technology lead to some form of global
knowledge network, and sooner rather than later.
Some of the
possibilities offered by a unified global archive are suggested by the Los
Alamos e-print archives (where e-print denotes self archiving by
the author), which, since their inception in 1991, have become a major forum
for dissemination of results in physics and mathematics. These e-print archives
are entirely scientist driven, and are flexible enough either to co-exist with
the pre-existing publication system, or to help it evolve into something better
able to meet researcher needs. The archives are an example of a service created
by a group of specialists for their own use. It is also important to note that
the rapid dissemination they provide is not in the least inconsistent with
concurrent or subsequent peer review, and in the long run offers a possible
framework for a more functional archival structuring of the literature than is
provided by current peer review processes.
The electronic medium can do it cheaper and
better
As argued by Odlyzko, [1] the current
methodology of research dissemination and validation is premised on a paper
medium that was difficult to produce, difficult to distribute, difficult to
archive, and difficult to duplicatea medium that hence required numerous
local redistribution points in the form of research libraries. The electronic
medium is opposite in each of the above regards, and, hence, if we were to
start from scratch today to design a quality controlled distribution system for
research findings, it would likely take a very different form both from the
current system and from the electronic clone it would spawn without more
constructive input from the research community.
The need to reconsider the above methodology is
reinforced by noting that each article typically costs many tens of thousands
of dollars to produce in salaries, and much more in equipment and overhead. A
key point of the electronic communication medium is that, for a minuscule
additional fraction of this amount, it is possible to archive the article and
make it freely available to the entire world in perpetuity. Moreover, this is
consistent with public policy goals for what is in large part publicly funded
research. [3] The nine-year lesson so
far from the Los Alamos archives is that this additional cost, including the
cost of the global mirror network, can be as little as a dollar per article,
and there is no indication that maintenance of the archival portion of the
database will require an increasing fraction of the time, cost, or effort.
Odlyzko has also pointed out that average aggregate
publisher revenues are roughly $4,000 per article
. [2] Of course, some of the publisher revenues are
necessary to organize peer review, although the latter depends on the donated
time and energy of the research community and is subsidized by the same grant
funds and institutions that sponsor the research in the first place. The
question crystallized by the new communications medium is whether this
arrangement is the most efficient way to organize the review and certification
functions
.
A new model for
research communications

The figure above
illustrates one such possible hierarchical structuring of our research
communications infrastructure. It also represents graphically the key
possibility in the new electronic architecture: that of disentangling and
decoupling the production and dissemination on the one hand, from the quality
control and validation on the other, in a way that is not possible in the paper
realm. The figure shows three electronic service layers, as viewed by the
interested reader/researcher, who can choose the most auspicious access method
for navigating the electronic literature. The three layers are the data,
information, and knowledge networkswhere information is taken to mean
data plus metadata (i.e., descriptive data), and knowledge signifies
information plus synthesis (i.e., additional synthesizing information).
The knowledge layer
includes third parties that can overlay the information and data levels with
synthesizing information, and can partition the information into sectors
according to subject area, overall importance, quality of research, degree of
pedagogy, interdisciplinarity, or other useful criteria. They can also maintain
other useful retrospective resources, such as suggesting a minimal path through
the literature to understand a given article, and suggesting pointers to
outstanding lines of research later spawned by it.
The three layers
depicted are multiply interconnected. The information layer can harvest and
index metadata from the data layer to generate an aggregation which can in turn
span more than one particular archive or discipline. The knowledge layer points
to useful resources in the information layer. The synthesizing information in
the knowledge layer is the glue that assembles the building blocks from the
lower layers into a knowledge structure more accessible to both experts and
non-experts.
The role of journals in
this new hierarchy is to serve as pointers to selected entries at the data
level. This is identical to the current primary role of journals: to select and
certify specific subsets of the literature for the benefit of the reader. A
heterodox point that arises in this model is that a given article at the data
level can be pointed to by multiple such virtual journals, insofar as they are
trying to provide a useful guide to the reader. Such multiple appearance would
no longer waste space on library shelves, nor be viewed as dishonest. This
could tend to reduce the overall article flux and any tendency on the part of
authors towards creating least publishable units. The author of the
future could thereby be promoted on the basis of quality rather than quantity:
instead of 25 articles on a given subject, the author can point to a single
critical article that appears in 25 different journals.
The reader can choose
how best to proceed for any given application: either trolling for gems
directly from the data level (as many graduate students are occasionally wont
to do, hoping to find a key insight missed by the mainstream), or instead
beginning the quest at the information or knowledge levels, in order to benefit
from some form of prefiltering or organization. The reader most in need of a
structured guide would turn directly to the highest level of value
added knowledge in the knowledge network.
This is where capitalism
should return to the fore: researchers can and should be willing to pay a fair
market value for services provided at the information or knowledge levels that
facilitate and enhance the research experience. However, for reasons detailed
above, we expect that access at the raw data level can be provided without
charge to readers. In the future this raw access can be further assisted not
only by full text search engines, but also by automatically generated reference
and citation linking. The experience from the physics e-print archives is that
this raw access is extremely useful to researchers, and the small admixture of
noise from a non-peer reviewed sector has not constituted a major problem.
Research in science has certain well defined checks and balances, and is
ordinarily pursued by certain well defined communities.
Change will come
through experiment and evolutionary forces
Ultimately, issues regarding the correct configuration of electronic research
infrastructure will be decided experimentally, and it will be edifying to watch
the evolving roles of the current participants. It is also useful to bear in
mind that much of the entrenched current methodology is largely a post-World
War II construct, including both the large scale entry of commercial publishers
and the widespread use of peer review for mass implementation of quality
control (neither necessary to, nor a guarantee of, good science). Ironically,
the new technology may allow the traditional players from a century ago, namely
the professional societies and institutional libraries, to return to their
dominant role in support of the research enterprise.
The original objective of the Los Alamos archives was to
provide functionality that was not otherwise available, and to provide a level
playing field for researchers at different academic levels and different
geographic locationsthe dramatic reduction in cost of dissemination came
as an unexpected bonus. As Andy Grove of Intel has pointed out, [4]
when a critical business element is changed by a factor of 10, it is necessary
to rethink the entire enterprise. The Los Alamos e-print archives suggest that
dissemination costs can be lowered by more than two orders of magnitude, not
just one.
In the next 10 to 20
years, it is likely that many research communities will move to some form of
global unified archive system, without the current partitioning and access
restrictions familiar from the paper medium, for the simple reason that it is
the best way to communicate knowledge and hence to create new knowledge.
The figure illustrates
one such possible hierarchical structuring of our research communications
infrastructure. It also represents graphically the key possibility in the new
electronic architecture: that of disentangling and decoupling the production
and dissemination on the one hand, from the quality control and validation on
the other, in a way that is not possible in the paper realm.
Data level: the figure
shows a small number of potentially representative providers, including the Los
Alamos e-print arXiv (and implicitly its international mirror network), a
university library system such as the California Digital Library (CDL), and a
typical foreign funding agency, such as the French Centre Nationale de
Recherche Scientifique (CNRS). These are intended to convey the likely
importance of library and international components. Note that there already
exist cooperative agreements with each of these to coordinate via the
open archives protocols (http://www.openarchives.org/) to
facilitate aggregate distributed collections.
Information level: the
figure shows a generic public search engine (Google), a generic commercial
indexer (Institute for Scientific Information, ISI), and a generic government
resource (the PubScience initiative), suggesting a mixture of free, commercial,
and publicly funded resources at this level. For the biomedical audience at
hand, I might have included services like Chemical Abstracts and PubMed at this
level. A service such as GenBank is a hybrid in this setting, with components
at both the data and information layers. The proposed role of PubMedCentral
would be to fill the electronic gaps in the data layer highlighted by the more
complete PubMed metadata.
Knowledge level: the
figure shows a tiny set of existing physics publishers: American Physical
Society (APS), Journal of High Energy Physics (JHEP), and Applied and
Theoretical Mathematical Physics (ATMP); the second is based in Italy and
the third already uses the arXiv entirely for its electronic dissemination. It
also shows BioMed Central (BMC). These are the third parties that can overlay
additional synthesizing information on top of the information and data levels;
partition the information into sectors according to subject area, overall
importance, quality of research, degree of pedagogy, interdisciplinarity,
useful criteria; and maintain other useful retrospective resources, such as
suggesting a minimal path through the literature to understand a given article,
and suggesting pointers to outstanding lines of research later spawned by it.
The synthesizing information in the knowledge layer is the glue that assembles
the building blocks from the lower layers into a knowledge structure more
accessible to both experts and non-experts.
The three layers
depicted are multiply interconnected. The [arrows from the middle
info section pointing to the boxes below] indicate that the
information layer can harvest and index metadata from the data layer to
generate an aggregation, which can in turn span more than one particular
archive or discipline. The [arrows from the knowledge line pointing
to the middle info boxes] suggest that the knowledge layer points
to useful resources in the information layer. The [long arrows pointing from
the knowledge line to the first box of the data
line]critical hererepresent how journals of the future can exist in
an overlay form, i.e., as a set of pointers to selected entries at
the data level. The [arrows coming from the readers eye] suggest how the
reader might best proceed for any given application: either trolling for gems
directly from the data level (as many graduate students are occasionally wont
to do, hoping to find a key insight missed by the mainstream), or instead
beginning the quest at the information or knowledge levels, in order to benefit
from some form of pre-filtering or organization.
References
1.
Odlyzko, A. Tragic loss or good riddance? The impending demise of
traditional scholarly journals. International Journal of Human-Computer
Studies 1995; 42:71122. Also available in the electronic Journal
of University Computer Science pilot issue, 1994.
[Back to Article]
2.
Odlyzko, A. Competition and
cooperation: libraries and publishers in the transition to electronic scholarly
journals. Journal of Electronic Publishing 1999.
[Back to Article]
3.
Bachrach, S. et al.
Who Should
Own Scientific Papers? Science 1998; 281:145960.
[Back to Article]
4.
Grove, A. Only the Paranoid Survive: How to Exploit the Crisis Points
That Challenge Every Company and Career. Bantam Doubleday Dell, 1996.
[Back to Article]
Web site for the
Freedom of Information Conference
Full text of
article
|