Wednesday, May 28, 2008
Why Write a Blog? (Posted by Paul Heymann)
Blogging is Huge
About a year ago now, I started working on a paper called "Can Social Bookmarking Improve Web Search?" Interestingly, that paper ended up being more about the nature of social bookmarking data (URLs which have been annotated with keyword "tags" by users) than it was really about web search itself. (I suppose that makes sense, given that we are the former "database group" and fascinated by data or information in a wide variety of contexts.) Specifically, what we really ask is:
- Is the data produced by social bookmarking systems different enough from other data that search engines have access to that it really constitutes "new information?"
- Are social bookmarking systems producing enough data to make a difference? On the scale of the web?
del.icio.us, the social bookmarking site I analyzed, gets over 100,000 posts on an average day. (See, for example, deli.ckoma, which has daily information about the number of posts to del.icio.us, going back several years.) But in the course of my analysis, I needed something to compare to that number.
Is 100,000 URLs with a few tags (keyword annotations) a large data source, or a small one, and compared to what? The most natural comparison I could find was the blogosphere.
Blog posts seem to:
- Usually have at least one link to other, related, outside material (i.e., point to new and interesting URLs).
- Usually have some discussion of that outside material (i.e., "annotate URLs").
- Usually get written by end users who might not be building large scale websites (i.e., are "user-generated content").
When I started looking into numbers for the growth of blogs and the current quantity of blog posts, I was surprised. Blogging is about an order of magnitude bigger than social bookmarking (at least, for now), despite usually being more detailed and requiring more end user effort. Sifry, for example, puts the number of blog posts per day around 1.4 million blog posts per day.
Blogging is one of the most massive and dynamic phenomena on the web today.
Blogging is Structured
Database researchers have been fascinated by the web for a long time. In 1998, a group of the top researchers in databases got together to try to outline the research challenges for the next decade. What they produced was the Asilomar Report on Database Research, which, among other things, concludes that the grand challenge for database research for the next ten years should be:
The Information Utility: Make it easy for everyone to store, organize, access, and analyze the majority of human information online.However, database researchers tend to like schema and structure in data, something which has been pretty uncommon on the web until recently. There are some new developments which might give the web more structure, for example, Microformats or the Semantic Web. But it seems like we are going to be stuck with our current web for a while yet. And for now, the most structured data is coming from things like blogs with posts, RDF, RSS, Atom, Pingbacks, Trackbacks, and a variety of other structured output and interactions.
Blogging may be the web's best hope for structured, machine-readable data.
The Web is Becoming Key To Disseminating Research
Researchers have a responsibility to disseminate their most interesting results to their community, and often to the public at large. Over the past decade, that has become increasingly easy. Specifically, the web has made it possible (and even simple) for researchers to make available research results which would only have been available to a small subset of academics and industry researchers a decade ago.
This has led to a conflicts like:
- Should research be Open Access?
- How can we keep double blind peer review while still making research results available on the web in a timely manner?
- Should journals exist in an era when publishing can be so easily done on the web? (The arXiv is a powerful example of un-peer reviewed, quality work published on the web.)
Years ago, the InfoLab did something unusual at the time. We started putting our publications up on the web, with structured data describing them, at our DBPubs publication server.
Now we think it is the right time to join the growing movement of researchers who use blogs to publicize and join a conversation about their work. Some of those people in Computer Science include Scott Aaronson, Hal Daume III, Greg Linden, and John Riedl.
We hope that the eclectic mix of research at the InfoLab will lead to an interesting and useful InfoBlog, for you, our readers, and that you will join us in this conversation.