Michael Wen's Research Projects

Michael Wen 溫泰皓During my M.S. I conducted several small to medium-sized research projects either alone or on a team. Most of them are projects assigned in classes but attention to detail and a passion for exploration being my second nature, I almost always extend them and make them work better. A research project is different from a regular programming project in several aspects:

1. A research project doesn't tell you exactly how to do something. It only tells you the goal. It is you or your team who should figure out how to approach it.

2. A research project should be driven toward solving a problem that hasn't been solved completely. It shouldn't be solving problems that have already been thoroughly explored and concretely solved. The following projects, as you can see, include uncertainty elements that are left with me or my team to solve. Most of the time, if not all of the time, the problem is not completely solved and our solution leaves a HUGE room for improvement. Therefore they are considered 'Research Projects' instead of 'Programming Projects' as those listed in My Works.

GoogleDoc: A Document Search Engine
In a distributed system course I took we decided to do a project with Google API. Our vision to create a document search engine so that the user can find a document on the Internet that has a high degree of similarity to the given document. Possible applications include checking for plagiarism in the academia, learning more about a big topic, and checking to see whether an idea is already taken. We use JICOS to do distributed and parallel computing so that computation of results of a query can be returned in the shortest amount of time possible.

One thing worth noting is that Google's rank is based on more than 100 factors, many of which are related to the popularity of the web page. Therefore the highest ranked web page does not even have to contain similarity the query document. The goal of a document search engine is that if there exists such a similar document on the Web, we should find it and it should be ranked at the very top. Therefore we came up with our own ranking mechanism that assigns each returned URL a score based on purely the similarity and relevancy. The user can choose to display the results according to Google rank, Our rank, or alphabetical order.

Also Google API imposes several constraints such as the number of queries one can make and so on. Since our underlying search system relies on Google API completely, we have no control over that. Our system is written in Java 100%. We use JSP technology so that our system can be made accessible to anyone who has an Internet-enabled browser. Here are a couple of snapshots of GoogleDoc's user interface and results' display:

Google Doc
Google Doc

As you can see Ice and Fire coexist in 'GoogleDoc', meaning that it is able to perform well under situations of both extremes.

WS-Security Framework: Is it really secure?
In this project we focus on answering the question: Is the WS-Security Framework adequate to handle every type of real-world scenario that requires rigorous security? The scenario we are looking at is a loan bidding scenario. The client logs on a loan bidding website and seeks bids for a loan of certain size and certain period. The website sends the request out to the banks it trusts and begins the negotiation process. Each bank has various departments internally and each of them has certain responsibilities. In the event where a loan cannot be fulfilled, a bank seeks bids from its 'partners' which are independent entities who are not trusted by the loan bidding website. Of course the details are much more complicated. This is just the general idea. As you can see there are many security requirements that must be fulfilled by the WS-Security Framework in order for this scenario to migrate to web service.

WS-Security is the foundation block that takes care of message-level security. Many extensions were drafted to handle different issues. They include WS-Trust, WS-Policy, WS-SecureConversation, WS-Federation, and so on. We focused on WS-Security, WS-Trust, WS-Policy, and WS-SecureConversation to see if they could fulfill the security requirements of our scenario. Here is what WS-Framework looks like (from BEA Systems):

WS Security Framework

The conclusion is that the WS-Security Framework is adequate in most situations but there are things its specifications do not address and there are things we don't like about them. This project quickly evolves into a 30-page report detailing what we've found :-)

Although this project is not implementation-related, there are tons of specifications we need to read and UNDERSTAND. It's inevitably difficult to understand them because they are drafted by dozens of experienced engineers from industrial leaders like Microsoft, IBM, BEA, and Verisign. Despite all that we managed to find several loopholes and potential scenarios that allow people to exploit those loopholes. Not bad for 3 weeks of work by only 3 people, huh?

GTM API for Berkeley DB
Berkeley DB is a well-known database management service provider. It is developed by Sleepycat Software and it has been quite successful. It contains implementations in a wide variety of platforms, including C, C++, and Java. In a class I took we were assigned the task of writing a GTM, or global transaction manager, so that the system can do distributed database management in the presence of failures.

Every distributed database is associated with an LTM, or local transaction manager, to manage the local database. Anyone working in the database industry knows that getting a database management system (DBMS) to work correctly without having to consider failures is extremely simple. Taking into account the possibility of failures, however, makes the system ultra-extremely (was trying to think of a word even more intense than extremely...) difficult to build. The point is that every distributed database must agree upon an action for an unresolved transaction; it either aborts or commits. A GTM or LTM could fail at any point in time (could get down to instruction-level failure).

So in building the system we need to consider every possible failure, or as many of them as possible. Very often switching two lines of code yields totally different results. With many years of experience in programming and attention to fine detail, we were able to build a successful system. We also wrote sample applications that could communicate interactively with the LTMs, which was used to test the failure-handling aspect of our system.

The following is a diagram describing the model of our system and the message types:

Global Transaction Manager Framework

Dynamic Dictionary Problem
In this research project we are supposed to solve the dynamic version of the famous Dictionary Problem. The dictionary problem is explained in detail by the paper 'Efficient Solutions to the Replicated Log and Dictionary Problems' by Gene T.J. Wuu and Arthur J. Bernstein. In essence you have many distributed systems sitting in different geographical areas and they are communicating through a network. Each system is capable of performing operations on an imaginary 'centralized' database, and the only requirement is that each system must hold consistent global view relative to one another.

Putting it in a simpler way, a system can add or delete a word from its 'dictionary' which is stored in its local storage. Once I do something I need to inform everybody else so that they can do the same. Each system is multi-threaded to enable maximum efficiency and concurrency. However the systems need to have concurrency control so that no system would have more than one thread enter its critical section simultaneously. Basically our job is to implement a 'quorum' algorithm, many of which have been proposed by many papers. We need to take care of deadlocks, a new node joining, an existing node leaving, and mutual exclusion. After extensive searching and evaluating, we found that Maekawa's A N^(1/2) Algorithm for Mutual Exclusion in Decentralized Systems is effective and relatively easy to implement. After spending some time we correctly implemented the whole system. A node can join or leave anytime it feels like without disrupting the updating of information at each node. How about a big hand for the brains :-)

Eye Detection Program
It's self-explanatory. The program is given an image and is supposed to tell you where eyes of people are located. As you can see finding eyes reliably is extremely easy for people to do but difficult for machines to do. I used Matlab because it has a very useful image toolbox that I can use to my advantage. Anyway my approach is broken down into two stages: find the face, and then find the eyes. To find the face, I designed two methods: one utilizing the high symmetry of a human face and the other using average face matching.

In the first method I scan the image for the region that has the best symmetry, and then perform some optimization to get better results. Also I apply a skin filter (2 types the user can choose from) to the face detection window to further shrink down the search space.

In the second method I scan the image for the region that has the best correlation with the given average face template. To find eyes I also implemented two ways: rule-based one and average eye-matching. The rule-based method simply means the program tries to find two clusters of dark pixels (eyes are dark...) with some constraints applied to them. It takes too long and is highly error-prone. The average eye-matching, however, is extremely fast and gives accurate results, but user must supply the approximate ratio of the face size to the image size. This requirement renders this method less practical because ideally, user shouldn't tell the program anything other than the image itself.

Anyhow, I made is so that the user can choose any combination of methods via command-line options. I tested my system with 30 images and got about 80% accuracy. The following are a snapshot of running my program (symmetry for face finding and average-eye matching for eye finding):

Eye Detection Software Trial Run Results with Matlab

Network Intrusion Detection
My professor's lab has developed a network intrusion detection called LibAnomaly that tries to detect anomalous behavior in a log file or other system-generated record. Basically we want to know if users are not trying to break into the system by issuing dangerous or suspicious system calls. For example, if I observe that this user never touches /sbin/directory for the past 5 years and today he or she accesses the directory, I know something fishy may be going on.

Their system implements several learning models, and I added another model to it. I used a trie, or a radix tree, to store training data. Given the log record, I get a score of each query string in that log depending on how well its prefix is matched, how well its suffix is matched, how many characters are matched, and other measures. The higher the score, the less probability the string is an anomaly. This is basically a data structure thing; it doesn't really have anything to do with networking protocol or structure. Implementing a trie is not simple but doable :-)

Spatial & Temporal Queries on a Planar Graph
Spatial and temporal queries on a planar graph are a family of long-standing problems in graph and theory. Basically on a planar network, which can be likened to a road network or world-wide Internet network, we want to find, for example, the closest neighbor of a node. Our team needs to do a survey on different approaches to solve these problems and try to generalize the approaches. The papers include the following: This is actually a class assignment, but I extended it by coming up with a data structure that allows faster search time than the brute force way in the implementation of Roads, Codes, and Spatiotemporal Queries. The paper doesn't discuss how to efficiently search the codes to get the nearest neighbor of a query code. Anyway, it all comes down to given a set of binary codes of the same length, given another binary codes of the same length find the one that has the smallest hamming distance to it. For example if your structure has:

10110
00000
11111
11110
01111

And your query is '01110', then you should give '01111' as the result since it is closest to '01110' hammingly. Obviously you can do this by storing the codes in a linear structure and simply going through them linearly for the answer (commonly known as the brute force way), but a better, faster approach should exist. It is what I call a 'Hamming tree' since to my knowledge nobody has beat me to it yet. You build a tree in which each pair of nodes that shares a tree edge has a Hamming distance of 1. There is more to it than that, but I wouldn't want to stifle you with all the details. Of course you are always welcome to ask me if you have questions. The following is a table showing the performance of my Hamming tree against the brute force (Each code has 20 bits. Running time results are expressed in seconds):

Number of codes 10,000 50,000 100,000
Brute Force
k=1
k=20
k=50

.0299
.0599
.07

.1
.1899
.2599

.0599
.3199
.4099
Hamming Tree
k=1
k=20
k=50

0
.0599
.25

0
.0099
.0799

0
.0099
.0799

As you can see when the number of nodes is very high, Hamming Tree does a lot better than brute force. However, from a practical point of view the code size grows as O(square root of n), where n is the number of nodes. Therefore a code usually has a lot more than 20 bits, in which case building a Hamming tree takes too long. Anyway, it is a starting point for more serious research in this area because my hunch just tells me that a better way than brute force almost always exists :-)

ADVERTISING WITH US - I'd LOVE to advertise for you! Direct your requests to wentaihao at yahoo.com