« In the works... | Main | CDW-G - tell us the story »

October 31, 2007

WHAT GETS VIEWED? An exploratory study of large IR collections

In my work circle there has been a lot of talk about growing our institutional repository. There is a big push to add meaningful content. The thing that I always get hung up on though is usage. I’m very interested in what people find useful, and my feeling is that if I’m going to pitch this service to my faculty, then I need to prove to them that the stuff is actually being seen, rather than simply offering them a theoretical argument about why open access is good and big publishers are evil.

So I decided to do a mini study. I wanted to see what the top items viewed were across several universities. I used ROAR to identify DSpace collections in the US, and then sent emails to the libraries with the 10 largest collections. One library never responded, another (MIT) shot me down with ““I'm sorry to report that our staff is unable to provide that data at this time”—but all the others provided me with a list of their top 20 most viewed items. (Thanks!)
I should note that Georgia Tech and U of Oregon were the only organizations in this sample that allowed open access to their statistics.

The results were very eclectic, as expected, however there were definite themes that emerged. For example, the U of Rochester included many musical scores, U of Michigan was heavy with engineering technical reports, Ohio State had numerous articles from The Ohio Journal of Science, U Oregon featured NewBreed Librarian articles as well as classic texts from Shakespeare, Milton and others, while Oregon State included several environmental topics. U Maryland had the most diverse materials and is unquestionably the heaviest used collection within this sample.

Someone should publish a scholarly article about this and perform a detailed synthesis on these collections, but in the mean time, here are the top viewed items from each of the collections:

  • Delivery of DNA and Recombinant Infectious Bursal Disease Virus Vaccines in Ovo (dissertation), 34,768 hits, U of Maryland
  • How Do I Do This in ArcGIS/Manifold?: Illustrating Classic GIS Tasks, 18,636 hits, Cornell
  • Relaxation studies in the muscular discriminations required for touch, agility and expression in pianoforte playing, 8,764 hits, U of Rochester
  • A study of the role of carbon in temper-embrittlement and the effect of temper-embrittlement on the fatigue properties of a 3140 steel, 7,155 hits, U of Michigan
  • Dragonflies Taken in a Week, 6,650 hits, Ohio State
  • Measurement of delignification diversity within kraft pulping (dissertation), 5,517 hits, Georgia Tech (current year only)
  • NewBreed Librarian ; Vol. 2, No. 4, 2,093 hits, U of Oregon
  • Estimating the weight of plywood, 500 hits, Oregon State

There is definitely a lot of long-tail action going on too. Most of the repositories featured one or two heavily used items, but then dropped off drastically.

Umaryland_sample_ir_long_tail_2

Some questions:

  • Why is the U of Maryland IR used so heavily? Their top 3 items blow away everyone else (34,768 hits; 32,916 hits; and 32,214 hits respectively)
  • How are people finding this stuff? Google? Native Searches? Catalog Searches? Direct Links? We need to run an analytics program.
  • How many of these hits are from web crawlers or related software?
  • Why the long-tail? What makes those top few items so popular? And just how long is the tail? Could you say something like 90% of everything in our IR was viewed at least once over the past two years?
  • If you place your IR within your metasearch tool, will it pad your results?
  • Is there a big difference between views and downloads?
  • Why does the DSpace interface still look so mid-1990’s?
  • How are items obtained? Is it piecemeal or more systematic? Are we building collections or is it random take-what-we-can-get?
  • What is the percentage of dissertations? (or, take away dissertations and what have you got left?)
  • What non-text items are collected (mp3, videos, jpg, etc)?
  • Leaving the  big vision rhetoric aside, what is the goal of each IR?
  • How do you measure the success of an IR? Is it volume or downloads or something else?

(If this is your area and you want to work on something together, let me know. I'm devoted to ALA Editions right now, but I'd like to continue this project into 2008.)

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341cb28053ef00e54f7fcb118834

Listed below are links to weblogs that reference WHAT GETS VIEWED? An exploratory study of large IR collections :

Comments

Your questions about the full-text downloads are important and timely. We here at bepress announced last month that we had just completed an important project to improve the way Digital Commons counts downloads: http://www.bepress.com/download_counts.html.
Today I read that Southampton is also looking at tackling the problem: http://trac.eprints.org/projects/irstats/wiki

We found that the numbers for full-text downloads of open access material are inflated. By the end of 2007, bepress predicts that, without filtering, one out of every two logged downloads from academic sites will be made by machine or mistake. We also found that non-human downloads are far from uniformly or proportionately distributed across papers. For example, one paper moved from over 6000 hits (with our previous good filtering) to 6 hits (with our new, we hope excellent filtering). I would suggest that this might explain many of the anomalies you highlighted in your blog.

I wonder why you explored only DSpace Institutional Repositories (IRs)? There are 46 DSpace IRs registered in ROAR for the US, but there are also 34 US EPrints IRs and 39 US BePress IRs. In the UK, the figures are 18 DSpace IRs, 37 EPrints and 2 BePress. Worldwide, the figures are 176 DSpace, 145 EPrints and 49 BePress. A less DSpace-centred study might even reveal some instructive differences in functionality, policy and success rate...

Stevan Harnad
American Scientist Open Access Forum

Thanks for your response. It came down to wanting to compare apples to apples. My school uses DSpace and I wanted to measure our top items with our peers. The study was very exploratory – I was actually less interested in the number of hits and more so intrigued by the content. But sure, future investigate could consider other platforms, but for the sake of consistency I just wanted to keep it simple.

Most IR people I have spoken to say that the vast majority of their hits are directed to the IR by major search engines, such as Google.

At UR, we changed the DSpace interface so that the download counts for every item and collection is displayed in the interface. We wrote our own download counter program that allows us to ensure that we aren't counting web counter hits.


Our long tail is very long. There is nothing in our repository that hasn't been downloaded. I can say with almost absolute certainty that everything in our repository has been downloaded at least once each year.

I'd be glad to visit these issues with you in 2008.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

My Photo
Bookmark and Share
Blog powered by TypePad
Member since 05/2006