Posts Tagged graph database

ETL for Graph Mining

In the world of data mining and analyses, there is a special set of mining methods and algorithms related to data in graphs. Graph is mathematical construct which contains vertices (nodes) and edges between them. Edge connects a pair of vertices and can be directed or undirected. As a data structure, graph is used in various extensions of this simple model. One of them is the “property graph model”, which allows vertices and edges to be uniquely identified, labeled (with one or more labels), and contain a set of key-value properties. There are a number of NoSQL graph databases which employ this data model.

Graph analyses provide measures and algorithms that help find some information which is otherwise very hard to find (i.e. using relational model or a data warehouse). Some examples: find most important people (in a social network of a kind), most important pages (on a web site or collection of sites), most important servers (in a network); decide who/what needs best protection (from flu, cyber-attacks, ..); find fake profiles and reviews on the web; cut costs (the traveling salesman problem) etc. Some of these measures are centrality measures (e.g. betweenness, closeness) or locality measures (e.g. indegree, outdegree). Also, there is a bunch of algorithms which are aimed towards identifying frequent patterns or subgraphs in a graph, or identifying interesting subgraphs or anomalies in a graph as well.

So, when I have a collection of data in my favorite relational database, how can I make use of all those graph algorithms? There are several possibilities, such as implementing the algorithms inside a relational database, but my answer to this is to perform an ETL (extract-transform-load) process in order to extract the data from the relational database and create a graph database from it. Then I can perform whatever analysis I want on this separate graph OLAP system. To do that, I defined a relational-to-graph ETL process which follows a set of 10 rules. These rules define the way the rows and their attributes (with regard to primary and foreign keys) are transformed to vertices, edges and their properties. I will not go into the details of these rules here, but if interested, please go to slides 30-40 of my IIUG2017 presentation (available here) to find explained rules with examples.

Is there any code? Of course there is :). I implemented the process as a series of stored procedures and a view, which need to be created in a relational database you want to extract the data from. After the calls:

execute procedure create_conversion_statements('~\my_working_dir');
execute procedure create_conversion_scripts('~\my_working_dir');

several scripts are created in the specified directory. One script is an SQL unload-transform script, which need to be executed in the same relational database to create a bunch of data files (one per table). Other files are Cypher scripts (Cypher is an open language used by several graph database systems, like Neo4j). These scripts can be executed in a graph database of your choice in order to create and populate it with the exported data. Once the data is in the database, you can start exploring it using Cypher or some of the graph algorithms.

All the code is open-sourced on GitHub, available here, under my ifmx-utils project.

The examples and more info can also be found in the IIUG presentation.

Formal description of the algorithm and the science around it is described in the paper Property Oriented Relational-To-Graph Database Conversion.


Hope you’ll find some of this useful.


, , , , ,

Leave a comment

Informix goes NoSQL?

New Informix version – 12.10.xC2 – is available for several days now, and we can finally speak about the most important new feature in this release, officially called JSON compatibility. What lies beneath this is actually support for MongoDB functionality. MongoDB is very popular document-type NoSQL database and it’s no surprise the IBM choose its capabilities in order to bring the first of NoSQL features into Informix. Some of the news include:

  • native support for JSON and BSON documents
  • wire listener, a gateway server that enables the communication between MongoDB applications and Informix database through standard MongoDB drivers
  • data sharding- horizontal partitioning
  • MongoDB shell utilities available to use with Informix database

Even though some of the main MongoDB/NoSQL features are covered, there are still some of the MongoDB database commands and collection commands not supported.

Seems pretty good for first of NoSQL features. So why the question mark in the subject? To answer that, I have to write few words about CAP theorem. It is a proven theorem defined by Eric Brewer in 2000, and not so known among (relational) database people. In the simplest words, the CAP theorem states that, out of three distributed systems requirements (Consistency, Availability, Partition tolerance), you can only have two. Here, consistency means the same as A in ACID – commits are atomic throughout the system. Availability means that system is available at all times, while partition tolerance means the system will function correctly even if partitioned, and only a total network failure will prevent that. Since only two of these requirements could be meet at a time, these are often represented by a triangle, and the system could lie on one of triangle sides. The relational database systems occupy the consistency-availability side. Most NoSQL systems arose from the need for as much uptime and responsiveness as possible so NoSQL systems are often aligned on the sides near the vertex P. Here is a nice picture representing the CAP requirements and some of the databases (from Ofirm):


As we move in NoSQL direction, or if you’d like, to MongoDB features, that could imply a move toward different policies. Common logic states the database system should behave the same way as before, especially since the integration of JSON management goes deep in some parts of Informix, such as SPL or transactions. There is not much info available on how and if the partition tolerance in NoSQL-sense is actually implemented in Informix, has the goal somehow been shifted from A to P, hence placing at least some of Informix JSON capabilities on the same side of the triangle as the MongoDB? If yes, how does that affect other network critical operations, such as replication? If no, would the current network and replication policies be enough to satisfy a common MongoDB user’s partition tolerance needs?

Well, either way, I believe this new feature is another breakthrough and will find its usage in time to come. To emphasize that a bit: one of the most important features of the NoSQL systems is the ability to adapt fast to new customer needs, with instant schema changes. The truth is, no information system put in production is ever finished, but always needs to follow the changing world around it – legislative, expanding business, users demands. Schema migration has always been a lengthy and demanding process for a relational databases, but is a natural thing in NoSQL environment. On the other hand, enterprises may have needs for NoSQL system to cover the mission critical part of their businesses, but also the need for a stable transactional database to handle other data, where availability is much more important than partition tolerance. So this expansion of Informix may be an opportunity to mix the best of two worlds. Surely someone will find the great usage for these new features, and I’m looking forward to hear and read the success stories.

When we’re talking about NoSQL world, it should be mentioned there are four types of NoSQL databases: document (implemented now in Informix), key-value, big table, and graph databases (some of the features implemented in DB2). Truth be told, all of these should somehow be covered in Informix as well. I’m especially interested to see graph database capabilities implemented. Graph databases often use a property graph data model (for example, Neo4J), which is one of the most natural ways to describe various domains, and it would be very interesting to be able to play with both graph and relational data model in the same database. But I’ll write more on graphs in some other post…

, , , ,

1 Comment