Users of social networking services can connect with each other by forming communities for online interaction. Yet as the number of communities hosted by such websites grows over time, users have even greater need for effective community recommendations in order to meet more users. In this paper, we investigate two algorithms from very different domains and evaluate their effectiveness for personalized community recommendation. First is association rule mining (ARM), which discovers associations between sets of communities that are shared across many users. Second is latent Dirichlet allocation (LDA), which models user-community co-occurrences using latent aspects. In comparing LDA with ARM, we are interested in discovering whether modeling low-rank latent structure is more effective for recommendations than directly mining rules from the observed data. We experiment on an Orkut data set consisting of 492,104 users and 118,002 communities. We show that LDA consistently performs better than ARM using the top-k recommendations ranking metric, and we analyze examples of the latent information learned by LDA to explain this finding. To efficiently handle the large-scale data set, we parallelize LDA on distributed computers and demonstrate our parallel implementation's scalability with varying numbers of machines.
Collaborative Filtering for Orkut Communities: Discovery of User Latent Behavior
Wen-Yen Chen, Jon Chu, Junyi Luan, Hongjie Bai, Yi Wang, and Edward Y. Chang
International World Wide Web Conference (WWW)
Madrid, Spain, April 2009 (11% accepted).
PLDA: Parallel Latent Dirichlet Allocation
Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang
International Conference on Algorithmic Aspects in Information and Management (AAIM)
San Francisco, CA, June 2009.
Rapid growth in the amount of data available on social-network sites has made information retrieval increasingly challenging for users. We proposed Combinational Collaborative Filtering (CFF) to perform personalized community recommendations by considering multiple types of co-occurrences in social data at the same time. This filtering method fuses semantic and user informaiton, then applies a hybrid training stragety that combines Gibbs sampling and Expectation-Maximization algorithm. To handle the large-scale dataset, parallel computing is used to speed up the model training. Through an empirical study on the Orkut data set, we show CCF to be both effective and scalable.
Combinational Collaborative Filtering for Personalized Community Recommendation
Wen-Yen Chen, Dong Zhang, and Edward Y. Chang
ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining (KDD)
Las Vegas, NV, August 2008 (10% accepted).
Spectral clustering algorithm has been shown to be more effective in finding clusters than some traditional algorithms. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate various ways of approximating the dense similarity matrix. We compare one by sparsifying the matrix with another by the Nystrom method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a large document data set of 193,844 instances and a large photo data set of 2,121,863, we demonstrate that our parallel algorithm can effectively alleviate the scalability problem. A short version appears at ECML/PKDD 2008.
Parallel Spectral Clustering in Distributed Systems
Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and Edward Y. Chang
Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2010
Parallel Spectral Clustering
Yangqiu Song, Wen-Yen Chen, Hongjie Bai, Chih-Jen Lin, and Edward Y. Chang
European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD)
Antwerp, Belgium, September 2008 (18% accepted).
Also appears in Lecture Notes in Artificial Intelligence (LNAI), Vol. 5212, pp. 374-389, 2008.
Fotofiti is a research plateform for automating semantic annotation of digital photographs. It not only provides a web-based user interface for managing social networks, events and photographs, but also makes good use of a variety of metadata geared towards image classification and similarity assessment. Fotofiti is featured with real-time online semantic annotation using global features from both content and context. A manual annotation web interface is created to provide training examples for our classifier. Classification experiments using various learning techniques were performed on a real-world data-set. Additionally, a scalable landmark recognition system which utilizes local features is discussed.
Fotowiki is a wiki-based map service that integrates visual and textual information with map. Fotowiki divides a geographical area into sub-areas. An individual responsible for providing information about a sub-area enters collected data into a wiki page. Fotowiki uploads distributed wiki-pages, and overlays the information on the map. In addition to the traditional aerial images provided by the Google map, Fotowiki propagates both the street-level views of the surrounding area and 360-degree panorama tour of a spot to the map. More importantly, the fine-grained information about a particular location, provided by its incentive information collection strategy, can substantially improve the usefulness of information.