Thursday, April 11, 2013

HyperGraphDB: HyperGraph Based Database

So far all of the posts on this blog have been about graphs, be
they directed, undirected, simple, non-simple,..etc. That's great and graphs are my favorite, but for this post I thought that I'd shake things up a bit...talk about something similar but wacky at the same time ya know? Any guesses? ;-D Well if you read the title of this post it probably won't take you that long to figure it out. This post will be about the less studied but more generalized graph theoretic concept of hypergraphs.

Now I had heard of hypergraphs before, but until now I hadn't set aside enough time to really study them. What I picked up from my brief skimmings of the hypergraph literature was that the edges in hypergraphs could connect more than 3 vertices (in contrast to the graphs that we know where edges are associated with just two vertices). The idea sounded pretty abstract and different to me. No inkling of an idea as to their applications in pure or applied mathematics. To my surprise then when I found out very recently that hypergraphs are being used to store database information. 

WHAAAAAAAAAAAAAAAAAAAAAAAAAAATTTTTT???!!!! :-O

I know I know. You're probably all about as surprised as I was and no doubt itching to read on about the database business. To be as thorough as possible though and to really appreciate the ideas behind the application, let us first look at a more formal definition of hypergraphs and extend that notion to directed hypergraphs. My previous blurb about a hypergraph having "longer" edges will simply not do. 

A graph as I'm sure most of you know, is often defined in more formal terms as a pair of sets V (for the vertices) and E (for the edges). Hypergraphs are also defined in terms of two sets. These can also be denoted as V and E for the vertices and edges respectively. The vertices of a hypergraph are like the vertices of regular graphs; nothing unusual there. The edges though are defined to be non-empty subsets of the vertex sets. If all of the edges in the hypergraph are sets of cardinality two, then we have the regular graphs that we are used to dealing with. Note though that graphs are just special cases of hypergraphs. Following the above definition it is possible to have hypergraph edges that consist of 3 or more distinct elements of the vertex set. 

A natural thought that may creep into your mind is, how are hypergraphs visualized? From what I have seen I don't think that that area of graph visualization has been studied very much (at least relative to the visualization of regular graphs). So far I have only seen two ways of visualizing hypergraphs. The first way simply draws a line or tree connecting the vertices of an edge and assigns an edge labeling to make it clear that the edge covers 3 or more vertices rather than two. The other way that I have seen hypergraphs drawn is by visualizing the vertices of an edge all within some colored blob (being real technical here). Seeing two edges intersect within such a drawing is similar to seeing two sets of a Venn Diagram intersect. 

With the basic idea of a hypergraph in mind let us now see how things like paths and cycles can be generalized in the hypergraph sense. A walk of length k in a hypergraph is defined to be a sequence of k vertices v1 v2,...., vk where vi and vi+1 are contained in a common edge. A path in a hypergraph is then defined to be such a walk in which all of the vertices and edges are distinct. In a closed walk the beginning and ending vertices are the same, and likewise a closed path is a path where the starting and ending vertices match. If any of these definitions sound odd to you just remind yourself about the simplest case in which the hypergraph is a regular graph. Looking at things that way helps to give a more set theoretic view of the graphs that we are used to dealing with. 

The last (or on second thought second to last) generalized graph concept that I thought I would touch up on before getting to the fun database application is the extended hypergraph notion of subgraphs. Remember that subgraphs are basically graphs contained within graphs. They consist of pairs of sets V' and E' where V' and E' are both subsets of the original graphs vertex and edge sets V and E. The definition of a sub-hypergraph is a little bit more involved. Let J = {1,2,...,k} be an indexing set for a subset of hypergraph edges E1, E2,...Ek. Also, let V' be a subset of the hypergraph's vertices. The sub hypergraph induced by V' and J is defined to be the set of intersections between each Ej and V' where the Ej's range from 1 to k and the intersections considered are the nonempty ones. Again it's good to step back from this more abstract notion of subgraphs and think about how this relates to the simpler graph case. In that scenario imagine that the indexing set J goes from 1 to m for all of the m edges in the graph. The above definition then boils down to finding the resulting graph in which the only edges considered are the ones that have endpoints in the vertex subset V'. 

The final notes to make (before moving onto applications) which is really important for this post is the following. Just as hypergraphs generalize the idea of graphs, hypergraphs can also be generalized a bit. How? Well, rather than just pointing to nodes (and again it's assumed that hyperedges can correspond to 3 or more nodes) hyperedges in a more generalized view can "point" to other edges as well. This is the approach that is taken by the database application that we will now touch on. 

This database application is called HyperGraphDB. What it is is an embedded database that uses a hypergraph as its underlying storage mechanism. The nodes and edges (or links as HyperGraphDB calls them) are all categorized as "atoms" of the database (atoms are basically just database entities/objects). The links can of course point to any number of atoms, consistent with the general definition of a hypergraph. The particular example that I have shown here (as bolded code line below) makes use of HyperGraphDB's HGValueLinkg class. There are several link classes defined within HyperGraphDB, HGValueLink being the one you use when you want to associate some sort of value (can be any kind of object) with your hyperedge. In this example I represent my name with a String and pass it as the first argument to the HGValueLink constructor. What comes after that is an array named nodes. Not surprisingly, this array holds information relating to the nodes (and/or links) that are to be contained within the link variable hyperedge. 

HGValueLink hyperedge = new HGValueLink("Tanya", nodes);

Now some of you may be wondering, what exactly is the nodes array? What kind of information does it hold? It turns out that the array does not hold regular object references to the nodes being "connected", but rather to their associated "HGHandle"s. HGHandle is another class defined within the core api. They serve as identifiers for the objects added to the database. For the most part (there are exceptions to this) you will need to refer to an object's handle if you want to do something with it. This is definitely the case when your desire is to remove a node or link from the database. There are times when you can still refer to an atom through it's Java object reference, but from what I've seen so far HyperGraphDB is different in that it emphasizes the use of these handles rather than the traditional approach of using object references to get to things. 

The list of HyperGraphDB's capabilities goes on and on, but I will stop here as the main point of this post was to name a hypergraph database application and show how it conforms to the hypergraph definitions explained earlier. One thing you may still be wondering is, why would someone ever want to use a hypergraph database? What are they good at modeling? The answer depends on what type of data you are looking at and the information that you aim to get out of your data. One of the simplest examples that I could think of is the following. Suppose that you wish to model some of your Google+ hangouts data as a graph. You and all of the people that you have hung out with are the nodes of a graph, and an edge exists between two nodes if the corresponding people hung out. Even if you and two of you hangout buddies all form a cycle, there is no way of telling from this very basic graph model if let's say 3 of you were in a hangout all at the same time. To do that you may have to add information to the nodes and/or edges, partition your vertices into two groups where one of the groups represents people and the other a hangout on a given day or time, or as this post would suggest, model your hangouts in hypergraph style. All of the people involved in the same hangout will be contained within the same hyperedge. This underscores the extreme flexibility of hypergraphs in one sense, but also shows how the best solution can really depend on the problem at hand and what you hope to accomplish. 

If anyone has worked with HyperGraphDB or just hypergraphs as models in general please leave a comment. I am eager to learn more about hypergraphs! :-D