Posts Tagged migration

ETL for Graph Mining

In the world of data mining and analyses, there is a special set of mining methods and algorithms related to data in graphs. Graph is mathematical construct which contains vertices (nodes) and edges between them. Edge connects a pair of vertices and can be directed or undirected. As a data structure, graph is used in various extensions of this simple model. One of them is the “property graph model”, which allows vertices and edges to be uniquely identified, labeled (with one or more labels), and contain a set of key-value properties. There are a number of NoSQL graph databases which employ this data model.

Graph analyses provide measures and algorithms that help find some information which is otherwise very hard to find (i.e. using relational model or a data warehouse). Some examples: find most important people (in a social network of a kind), most important pages (on a web site or collection of sites), most important servers (in a network); decide who/what needs best protection (from flu, cyber-attacks, ..); find fake profiles and reviews on the web; cut costs (the traveling salesman problem) etc. Some of these measures are centrality measures (e.g. betweenness, closeness) or locality measures (e.g. indegree, outdegree). Also, there is a bunch of algorithms which are aimed towards identifying frequent patterns or subgraphs in a graph, or identifying interesting subgraphs or anomalies in a graph as well.

So, when I have a collection of data in my favorite relational database, how can I make use of all those graph algorithms? There are several possibilities, such as implementing the algorithms inside a relational database, but my answer to this is to perform an ETL (extract-transform-load) process in order to extract the data from the relational database and create a graph database from it. Then I can perform whatever analysis I want on this separate graph OLAP system. To do that, I defined a relational-to-graph ETL process which follows a set of 10 rules. These rules define the way the rows and their attributes (with regard to primary and foreign keys) are transformed to vertices, edges and their properties. I will not go into the details of these rules here, but if interested, please go to slides 30-40 of my IIUG2017 presentation (available here) to find explained rules with examples.

Is there any code? Of course there is :). I implemented the process as a series of stored procedures and a view, which need to be created in a relational database you want to extract the data from. After the calls:

execute procedure create_conversion_statements('~\my_working_dir');
execute procedure create_conversion_scripts('~\my_working_dir');

several scripts are created in the specified directory. One script is an SQL unload-transform script, which need to be executed in the same relational database to create a bunch of data files (one per table). Other files are Cypher scripts (Cypher is an open language used by several graph database systems, like Neo4j). These scripts can be executed in a graph database of your choice in order to create and populate it with the exported data. Once the data is in the database, you can start exploring it using Cypher or some of the graph algorithms.

All the code is open-sourced on GitHub, available here, under my ifmx-utils project.

The examples and more info can also be found in the IIUG presentation.

Formal description of the algorithm and the science around it is described in the paper Property Oriented Relational-To-Graph Database Conversion.


Hope you’ll find some of this useful.


, , , , ,

Leave a comment

Migrating your onconfig

The one really inconvenient thing about migrating an Informix instance is what the upgrade does to the onconfig file. Well, basically, it only adds the new config parameters to the end of file. So, what you lose is nice order and grouping of parameters with comments describing them. After doing a major upgrade on some instance, your onconfig would be a mess – fewer parameters would be documented in the file itself, and there’d be lots of new parameters at the end of file.

If you don’t want to live with this, you could take new version’s onconfig.std, copy it, run a diff to your current onconfig and manually insert your values in the copy. I like my production onconfigs neat, but as the onconfig grows, maintaining it becomes more time consuming.

So here is the much quicker solution. A groovy script that’ll do that for you. In one of my previous posts, I’ve described how to run groovy scripts. What the script does is take the onconfig.std, make a copy and put your values to the right places in that copy. Parameters that are commented out or don’t show in the std file are added to the end. Comments at the end of line containing actual parameter value are preserved. New onconfig file is named like the original one, with the “.new” at the end.

Here is a usage example. Prepare the current onconfig file for migration to version 11.70FC4:

groovy migrateOnconfig.groovy \
    -s /opt/IBM/informix1170FC4/etc/onconfig.std

The output would be the $INFORMIXDIR/etc/$ file.

And here is the script itself. Just copy the code in a file called migrateOnconfig.groovy, and feel free to try it. There’s nothing to lose, as there is no impact to existing files.

 * script to help migrating onconfig to whichever version
 * provide actual onconfig file and onconfig.std of version
 * your migrating on
 * parameters that have no default value in std are appended to the end of file
 * @author Ognjen Orel

// command line parsing specification
def cl = new CliBuilder(usage: getClass().getName() + ' options')
cl.h(longOpt: 'help', 'Show usage information and quit')
cl.i(argName: 'inputFile', longOpt: 'inputFile', args: 1, required: true, 
     'onconfig file currently in use, REQUIRED')
cl.s(argName: 'stdFile', longOpt: 'stdFile', args: 1, required: true, 
     'onconfig.std file to use as a template, REQUIRED')

def options = cl.parse(args)

File std = new File(options.s), output = new File(options.i + '.new')
List input = new File(options.i).readLines()

String param, additional
List written = new ArrayList()
def inputLines
def needsAdditional = ['VPCLASS', 'BUFFERPOOL']


std.eachLine { stdLine ->
   // copy all comment or empty lines
   if (stdLine.startsWith('#') || stdLine.trim().isEmpty())
      output << stdLine + '\n'
   else {
      param = stdLine.tokenize()[0]

      if (needsAdditional.contains(param))
         additional = stdLine.tokenize(',')[0].tokenize(' ')[1]
         additional = null

      inputLines = input.findAll{ 
          it.matches('(' + param + ')(\\s+)(.*)') || it.equals(param) }
      if (!inputLines.isEmpty()) {
         if (additional != null)
            inputLines = inputLines.findAll { it.contains(additional) }

         inputLines.each {
            output << it + '\n'
            written.add it
         output << stdLine + '\n'
// write all parameters with no default value in onconfig.std at the end
output << '\n\n### parameters with no default value in onconfig.std: \n\n'
(input - written).each {
   if (!it.trim().startsWith('#') && !it.trim().isEmpty())
      output << it + '\n'

, ,

Leave a comment