
During my M.S. I conducted several small to
medium-sized research projects either
alone or on a team. Most of them are projects assigned in classes but attention
to detail and a passion for exploration being my second nature, I almost always
extend them and make them work
better. A research project is different from a regular programming
project in several aspects:
1.
A research project doesn't tell you exactly how to do something. It only
tells you the goal. It is you or your team who should figure out how to approach
it.
2.
A research project should be driven toward solving a problem that hasn't been
solved completely. It shouldn't be solving problems that have already been
thoroughly explored and concretely solved.
The following projects, as you can see, include uncertainty elements that are
left with me or my team to solve. Most of the time, if not all of the time, the
problem is not completely solved and our solution leaves a HUGE room for improvement.
Therefore they are considered 'Research Projects' instead of
'Programming Projects' as those listed in
My
Works.
In a distributed system course I took we decided to do a
project with Google API. Our vision to create a document search engine so that
the user can find a document on the Internet that has a high degree of
similarity to the given document. Possible applications include checking for
plagiarism in the academia, learning more about a big topic, and checking to see
whether an idea is already taken. We use JICOS to do distributed and parallel
computing so that computation of results of a query can be returned in the
shortest amount of time possible.
One thing worth noting is that Google's rank is based on more than 100
factors, many of which are related to the popularity of the web page. Therefore
the highest ranked web page does not even have to contain similarity the query
document. The goal of a document search engine is that if there exists such a
similar document on the Web, we should find it and it should be ranked at the
very top. Therefore we came up with our own ranking mechanism that assigns each
returned URL a score based on purely the similarity and relevancy. The user can
choose to display the results according to Google rank, Our rank, or
alphabetical order.
Also Google API imposes several constraints such as the number of queries one
can make and so on. Since our underlying search system relies on Google API
completely, we have no control over that.
Our system is written in Java 100%. We use JSP technology so that our system
can be made accessible to anyone who has an Internet-enabled browser. Here are a
couple of snapshots of GoogleDoc's user interface and results' display:
As you can see Ice and Fire coexist in 'GoogleDoc', meaning that it
is able to perform well under situations of both extremes.
In this project we focus on answering the question: Is the
WS-Security Framework adequate to handle every type of real-world scenario that
requires rigorous security? The scenario we are looking at is a loan bidding
scenario. The client logs on a loan bidding website and seeks bids for a loan of
certain size and certain period. The website sends the request out to the banks
it trusts and begins the negotiation process. Each bank has various departments
internally and each of them has certain responsibilities. In the event where a
loan cannot be fulfilled, a bank seeks bids from its 'partners' which
are independent entities who are not trusted by the loan bidding website. Of
course the details are much more complicated. This is just the general idea.
As you can see there are many security requirements that must be fulfilled by
the WS-Security Framework in order for this scenario to migrate to web service.
WS-Security is the foundation block that takes care of message-level security.
Many extensions were drafted to handle different issues. They include WS-Trust,
WS-Policy, WS-SecureConversation, WS-Federation, and so on. We focused on
WS-Security, WS-Trust, WS-Policy, and WS-SecureConversation to see if they could
fulfill the security requirements of our scenario. Here is what WS-Framework
looks like (from BEA Systems):
The conclusion is that the WS-Security Framework is
adequate in most situations but there are things its specifications do not
address and there are things we don't like about them. This project quickly
evolves into a 30-page report detailing what we've found :-)
Although this project is not implementation-related, there are tons of
specifications we need to read and UNDERSTAND. It's inevitably difficult to
understand them because they are drafted by dozens of experienced engineers from
industrial leaders like Microsoft, IBM, BEA, and Verisign. Despite all that we
managed to find several loopholes and potential scenarios that allow people to
exploit those loopholes. Not bad for 3 weeks of work by only 3 people, huh?
Berkeley
DB is a well-known database management service provider. It is developed by
Sleepycat Software and it has been quite successful. It contains implementations
in a wide variety of platforms, including C, C++, and Java. In a class I took we
were assigned the task of writing a GTM, or global transaction manager, so that
the system can do distributed database management in the presence of failures.
Every distributed database is associated with an LTM, or local transaction
manager, to manage the local database. Anyone working in the database industry
knows that getting a database management system (DBMS) to work correctly
without
having to consider failures is extremely simple. Taking into account the
possibility of failures, however, makes the system ultra-extremely (was trying
to think of a word even more intense than extremely...) difficult to build. The
point is that every distributed database must agree upon an action for an
unresolved transaction; it either aborts or commits. A GTM or LTM could fail at
any point in time (could get down to instruction-level failure).
So in building
the system we need to consider every possible failure, or as many of them as
possible. Very often switching two lines of code yields totally different
results. With many years of experience in programming and attention to fine
detail, we were able to build a successful system. We also wrote sample
applications that could communicate interactively with the LTMs, which was used
to test the failure-handling aspect of our system.
The following is a diagram describing the model of our system and the message
types:
In this research project we are supposed to solve the dynamic
version of the famous Dictionary Problem. The dictionary problem is explained in
detail by the paper 'Efficient Solutions to the Replicated Log and
Dictionary Problems' by Gene T.J. Wuu and Arthur J. Bernstein. In essence
you have many distributed systems sitting in different geographical areas and
they are communicating through a network. Each system is capable of performing
operations on an imaginary 'centralized' database, and the only
requirement is that each system must hold consistent global view relative to one
another.
Putting it in a simpler way, a system can add or delete a word from its
'dictionary' which is stored in its local storage. Once I do something
I need to inform everybody else so that they can do the same. Each system is
multi-threaded to enable maximum efficiency and concurrency. However the systems
need to have concurrency control so that no system would have more than one
thread enter its critical section simultaneously. Basically our job is to
implement a 'quorum' algorithm, many of which have been proposed by
many papers. We need to take care of deadlocks, a new node joining, an existing
node leaving, and mutual exclusion. After extensive searching and evaluating, we
found that Maekawa's
A N^(1/2) Algorithm for Mutual Exclusion in
Decentralized Systems is effective and relatively easy to implement. After
spending some time we correctly implemented the whole system. A node can join or
leave anytime it feels like without disrupting the updating of information at
each node. How about a big hand for the brains :-)
It's self-explanatory. The program is given an image and is supposed to tell
you where eyes of people are located. As you can see finding eyes reliably is
extremely easy for people to do but difficult for machines to do. I used
Matlab because it has a very useful image toolbox that I can use to my
advantage. Anyway my approach is broken down into two stages: find the face, and
then find the eyes. To find the face, I designed two methods: one utilizing the
high symmetry of a human face and the other using average face matching.
In the
first method I scan the image for the region that has the best symmetry, and
then perform some optimization to get better results. Also I apply a skin filter
(2 types the user can choose from) to the face detection window to further
shrink down the search space.
In the second method I
scan the image for the region that has the best correlation with the given
average face template. To find eyes I also implemented two ways: rule-based one
and average eye-matching. The rule-based method simply means the program tries
to find two clusters of dark pixels (eyes are dark...) with some constraints
applied to them. It takes too long and is highly
error-prone. The average eye-matching, however, is extremely fast and gives
accurate results, but user must supply the approximate ratio of the face size to
the image size. This requirement renders this method less practical because
ideally, user shouldn't tell the program anything other than the image itself.
Anyhow, I made is so that the user can choose any combination of methods via
command-line options. I tested my system with 30 images and got about 80% accuracy. The
following are a snapshot of running my program (symmetry for face finding and
average-eye matching for eye finding):
My professor's lab has developed a network intrusion detection called LibAnomaly
that tries to detect anomalous behavior in a log file or other system-generated
record. Basically we want to know if users are not trying to break into the
system by issuing dangerous or suspicious system calls. For example, if I
observe that this user never touches /sbin/directory for the past 5 years and
today he or she accesses the directory, I know something fishy may be going on.
Their system implements several learning models, and I added another model to
it. I used a trie, or a radix tree, to store training data. Given the log
record, I get a score of each query string in that log depending on how well its
prefix is matched, how well its suffix is matched, how many characters are
matched, and other measures. The higher the score, the less probability the string
is an anomaly. This is basically a data structure thing; it doesn't really have
anything to do with networking protocol or structure. Implementing a trie is not simple but
doable :-)
Spatial and temporal queries on a planar graph are a family of
long-standing problems in graph and theory. Basically on a planar network, which
can be likened to a road network or world-wide Internet network, we want to
find, for example, the closest neighbor of a node. Our team needs to do a survey
on different approaches to solve these problems and try to generalize the
approaches. The papers include the following:
- [SKS02]
C.Shahabi and M.R.K.M. Sharifzadeh. A road network embedding technique for
k-nearest neighbor search in moving object databases. In ACM-GIS 2002,
McLean, VA.
- [PZMT03]
D.Papadias, J.Zhang, N.Mamoulis, and Y.Tao. Query Processing in Spatial
Network Databases. In VLDB 2003, Berlin, Germany.
- [JKPT03]
C. S. Jensen and J. Kolarvr and T. B. Pedersen and I. Timko. Nearest Neighbor
Queries in Road Networks. In ACMGIS 2003, New Orleans, Louisiana, USA.
- [KS04]
M. Kolahdouzan and C. Shahabi. Voronoi-based K Nearest Neighbor Search for
Spatial Network Databases. In VLDB 2004, Toronto, Canada.
- [OBSC00]
A. Okabe and B. Boots and K. Sugihara and S. N. Chiu. Spatial Tessellations,
Concepts and Applications of Voronoi Diagrams. John Wiley and Sons Ltd., 2nd
edition, 2000.
- [GKR04]
S. Gupta and S. Kopparty and C. Ravishankar. 'Roads, Codes, and
Spatiotemporal Queries', In PODS 2004, Paris, France.
This is actually a class assignment, but I extended it by coming up
with a data structure that allows faster search time than the brute force way in
the implementation of
Roads, Codes, and Spatiotemporal Queries. The
paper doesn't discuss how to efficiently search the codes to get the nearest
neighbor of a query code. Anyway, it all comes down to given a set of binary codes of the same length, given
another binary codes of the same length find the one that has the smallest
hamming distance to it. For example if your structure has:
10110
00000
11111
11110
01111
And your query is '01110', then you should give '01111' as
the result since it is closest to '01110' hammingly. Obviously you can
do this by storing the codes in a linear structure and simply going through them
linearly for the answer (commonly known as the brute force way), but a better,
faster approach should exist. It is what I call a 'Hamming tree' since
to my knowledge nobody has beat me to it yet. You build a tree in which each
pair of nodes that shares a tree edge has a Hamming distance of 1. There is more
to it than that, but I wouldn't want to stifle you with all the details. Of
course you are always welcome to ask me
if you have questions. The following is a table showing the performance of my
Hamming tree against the brute force (Each code has 20 bits. Running time results are expressed in seconds):
| Number of codes
|
10,000
|
50,000
|
100,000
|
Brute Force
k=1
k=20
k=50 |
.0299
.0599
.07 |
.1
.1899
.2599 |
.0599
.3199
.4099 |
Hamming Tree
k=1
k=20
k=50 |
0
.0599
.25 |
0
.0099
.0799 |
0
.0099
.0799 |
As you can see when the number of nodes is very high, Hamming Tree does a lot
better than brute force. However, from a practical point of view the code size
grows as O(square root of n), where n is the number of nodes. Therefore a code
usually has a lot more than 20 bits, in which case building a Hamming tree takes
too long. Anyway, it is a starting point for more serious research in this area
because my hunch just tells me that a better way than brute force almost always
exists :-)