InfoLab Logo Header

« Home | Next » | Next » | Next » | Next » | Next » | Next » | Next » | Next » | Next » | Next »

SpotSigs: Are Stopwords Finally Good for Something? (Posted by Martin Theobald)

In almost all classical Information Retrieval settings that have a text processing component, stopwords are first discarded before anything interesting happens with the document. “Interesting” here might mean indexing the content for search, extracting features for automatic classification, or some other form of content analysis of whatever flavor. Jonathan (my co-author on the SpotSigs paper) had the amazing idea that stopwords may however be very good indicators of the actual interesting parts of a web page. It is especially useful to know where the interesting parts of a web page are when they are interspersed with “added-value” content such as advertisements or navigational banners. This is most strikingly the case with online news articles, but applies more generally across the web.

In our SpotSigs project, we tried to detect near-duplicate Web pages in the news domain. This is a particularly challenging setting because the page layouts are often literally drenched with ads or navigational banners as added by the different sites. The actual core articles constitute only a minor fraction of the overall page, which makes online news a very hard setting for any unsupervised clustering or deduplication approach. Moreover, near duplicates of the same core article frequently pop up from different news sites as most of the content gets delivered by the same sources, such as Associated Press, and the very same core articles then often end up completely unchanged (or only after some minor editing) on many of the sites.

In response to this setting, the idea of extracting more “localized” signatures—namely those that are close to stopword occurrences (hence spots)—was born. These localized signatures exploit the observation that stopwords are frequently and uniformly distributed throughout any form of natural-language text—at least in Western languages—but they remain very infrequent in the typical headline-style banners or ads. SpotSigs connects such stopword anchors (called antecedents) with a nearby n-gram, which is simply a concatenation of further text tokens that are themselves not a stopword, in a very similar way to classic Shingling on the entire page content. In choosing only those Shingles that are connected to a stopword antecedent, however, SpotSigs tends to extract more robust signatures than plain Shingling. At the same time it allows for a more efficient and less error-prone signature extraction as compared to many of the far more sophisticated tools for HTML layout analysis. Another nice property is how SpotSigs handles synthetic documents that do not exhibit any natural language text, like 404 documents that crawlers typically encounter. SpotSigs automatically discards such synthetic documents.

As a second focus of the paper, SpotSigs also addresses some algorithmic challenges by tackling the inherent quadratic complexity when having to consider all candidate pairs of documents for the deduplication step. Here, our entire clustering algorithm is developed on the simple idea that we may never need to compare documents (or their respective signature sets) if they already substantially vary in length – at least if some set-resemblance-based similarity measure such as Jaccard similarity is used. This basic observation lets us very efficiently match high-dimensional signature vectors using a combination of classic techniques such as collection partitioning and inverted index pruning. As a rather surprising result, we found that the SpotSigs matching algorithm may even outperform linear-time similarity hashing approaches like locality-sensitive hashing in runtime – at least for reasonably high similarity thresholds and if the distribution of document (or signature) lengths throughout the collection is not too skewed – which nicely advocates the old algorithmic paradigm that sorting may sometimes be favorable over hashing. Overall, for a collection of 1.6 million documents from the TREC WT10g benchmark, we achieve a remarkably fast runtime of only about 5 minutes for parsing and extracting signatures, and less than 15 seconds for the actual clustering step on top of our index structures, already on a single mid-range server machine.

What I personally like about SpotSigs is that all its parts – from the signature extraction, over to the partitioning and index pruning – seamlessly fit together like some seemingly random pieces of a puzzle that finally make a nice picture. For example, the partitioning approach is not only good for breaking down the overall quadratic runtime into many much smaller pieces, but it also helps to smooth the skew toward shorter documents that we typically find in web collections and to provide more balanced partition sizes (the slides provide a little more detail on this). Similar themes recur throughout the work, for example, the threshold-based pruning approach applied in no less than three variations throughout the entire algorithm – with all steps being based on the very same similarity bound we derive for the (weighted) Jaccard resemblance between two Spot Signature sets.

After the presentation at SIGIR 2008, we received some interesting comments about the inherently subjective nature of detecting near duplicates. This suggests that for future work, thinking about how to personalize the signature extraction step beyond just using stopword anchors would certainly be an intriguing direction. We'd also like to improve the clustering algorithm by further investigating disk-based index structures, possibly distributing the algorithm onto multiple machines, and extending our bounding approach for more similarity metrics such as the well-know Cosine measure, which is more commonly used in IR than for example Jaccard.

Have a look at the paper or slides for more details.



Labels: , , , , , , , , ,

  1. Anonymous Car Blog | September 7, 2009 8:52 PM |  

    I found your blog on google and read a few of your other posts. I just added you to my Google News Reader. Keep up the good work. Look forward to reading more from you in the future.

    Mengembalikan Jati Diri Bangsa | Kenali dan Kunjungi Objek Wisata di Pandeglang

  2. Anonymous blog seo | September 25, 2009 7:39 AM |  

    This is actually really interesting regarding your fact article here, This article is very informative.

    Kenali dan Kunjungi Objek Wisata di Pandeglang | morat marit | cah bagoes | oes tsetnoc

  3. Anonymous Smith | September 26, 2009 10:19 PM |  

    Nice post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I'll be subscribing to your feed and I hope you post again soon.

    if you do not mind, please visit my article related to pandeglang district in Banten, Indonesia at Kenali dan Kunjungi Objek Wisata di Pandeglang and also related to a leadership at Mengembalikan Jati Diri Bangsa

    Oes Tsetnoc | Oes Tsetnoc

  4. Anonymous Jack | October 4, 2009 12:33 AM |  

    I found your blog on google and read a few of your other posts. I just added you to my Google News Reader. Keep up the good work. Look forward to reading more from you in the future.

    iklan baris gratis | jaringan iklan gratis baris | iklan baris gratis | pasang iklan baris gratis | submit iklan baris gratis | media pasang iklan gratis | promosi gratis iklan baris gratis | iklan baris gratis | pasang iklan baris gratis

  5. Blogger Mizwar Smith | October 12, 2009 8:45 PM |  

    Thanks ever so much, very useful article. Great information!

    Pendatang Baru Kenali dan Kunjungi Objek Wisata di Pandeglang, Kembali Optimasi Kenali dan Kunjungi Objek Wisata di Pandeglang, Kenali dan Kunjungi Objek Wisata di Pandeglang Persaingan Semakin Sengit, Kenali dan Kunjungi Objek Wisata di Pandeglang, Optimasi Spam, Bolehkah?, Kenali dan Kunjungi Objek Wisata di Pandeglang SERP Baru, Kenali dan Kunjungi Objek Wisata di Pandeglang Turun Naik, Google “Ngedance” Pada Kenali dan Kunjungi Objek Wisata di Pandeglang, Kenali dan Kunjungi Objek Wisata di Pandeglang Masuk Halaman Pertama, Mencari Backlink dari .edu dan .gov, Masihkah Perlu?, Pandeglang, Banten – Eksotisme Pantai Tanjung Lesung, Pandeglang, Banten – Taman Nasional Ujung Kulon, Kenali dan Kunjungi Objek Wisata di Pandeglang

  6. Anonymous Fred Halls | October 28, 2009 6:44 PM |  

    Great Article guys - i'll be sure to visit this blog again.
    christmas gifts

  7. Blogger bitheads | October 29, 2009 12:01 AM |  

    Nice story . It's kind of funny and entertaining.Thank you for  sharing. the best place kenali dan kunjungi objek wisata di pandeglang to your vacation, i promise that's

leave a response