Posts Tagged scripts

ETL for Graph Mining

In the world of data mining and analyses, there is a special set of mining methods and algorithms related to data in graphs. Graph is mathematical construct which contains vertices (nodes) and edges between them. Edge connects a pair of vertices and can be directed or undirected. As a data structure, graph is used in various extensions of this simple model. One of them is the “property graph model”, which allows vertices and edges to be uniquely identified, labeled (with one or more labels), and contain a set of key-value properties. There are a number of NoSQL graph databases which employ this data model.

Graph analyses provide measures and algorithms that help find some information which is otherwise very hard to find (i.e. using relational model or a data warehouse). Some examples: find most important people (in a social network of a kind), most important pages (on a web site or collection of sites), most important servers (in a network); decide who/what needs best protection (from flu, cyber-attacks, ..); find fake profiles and reviews on the web; cut costs (the traveling salesman problem) etc. Some of these measures are centrality measures (e.g. betweenness, closeness) or locality measures (e.g. indegree, outdegree). Also, there is a bunch of algorithms which are aimed towards identifying frequent patterns or subgraphs in a graph, or identifying interesting subgraphs or anomalies in a graph as well.

So, when I have a collection of data in my favorite relational database, how can I make use of all those graph algorithms? There are several possibilities, such as implementing the algorithms inside a relational database, but my answer to this is to perform an ETL (extract-transform-load) process in order to extract the data from the relational database and create a graph database from it. Then I can perform whatever analysis I want on this separate graph OLAP system. To do that, I defined a relational-to-graph ETL process which follows a set of 10 rules. These rules define the way the rows and their attributes (with regard to primary and foreign keys) are transformed to vertices, edges and their properties. I will not go into the details of these rules here, but if interested, please go to slides 30-40 of my IIUG2017 presentation (available here) to find explained rules with examples.

Is there any code? Of course there is :). I implemented the process as a series of stored procedures and a view, which need to be created in a relational database you want to extract the data from. After the calls:

execute procedure create_conversion_statements('~\my_working_dir');
execute procedure create_conversion_scripts('~\my_working_dir');

several scripts are created in the specified directory. One script is an SQL unload-transform script, which need to be executed in the same relational database to create a bunch of data files (one per table). Other files are Cypher scripts (Cypher is an open language used by several graph database systems, like Neo4j). These scripts can be executed in a graph database of your choice in order to create and populate it with the exported data. Once the data is in the database, you can start exploring it using Cypher or some of the graph algorithms.

All the code is open-sourced on GitHub, available here, under my ifmx-utils project.

The examples and more info can also be found in the IIUG presentation.

Formal description of the algorithm and the science around it is described in the paper Property Oriented Relational-To-Graph Database Conversion.


Hope you’ll find some of this useful.


, , , , ,

Leave a comment

Using Groovy in Daily DBA Work

Groovy is a relatively new programming language. I like to think of it as Java on steroids. It is based on Java, so all of Java code is automatically Groovy code, but not vice versa. Groovy is also compiled to Java byte-code, so it can run on every JRE. Better yet, you can use GDK (Groovy Development Kit) which provides many very easy to use APIs – working with files, XML is lot easier than in Java.

Some of its main characteristics are:

  • it is real object oriented language,
  • it is a dynamic language,
  • it can be used for scripting.

This last thing means that you can write Groovy script file and execute it from the command line without the need to compile it first, and that’s what makes it appealing. You get to use real powerful language within simple script. Of course, lots of things you’d write in Groovy could be done using shell scripts and if you’re not a programming soul, you won’t find many usefulness in it. Nevertheless, some of my next posts will include ready-to-use Groovy scripts that you may find useful, so here are some hints on what you need to do in order to run Groovy on your PC or server.

First of all, you’ll need JRE (Java Runtime Environment), which is probably already installed on every desktop PC and most of servers. As you’re probably  going to run scripts from shell or command line, ensure that you have JAVA_HOME environment variable set. It should point to main Java install directory which contains bin, lib and other directories.

After that, you need to download Groovy binaries from Groovy download site. I suggest to download zipped binaries. This file is suitable for all operating systems, as libraries are all jar files, and binaries contain both windows batch files and Unix/Linux shell scripts. All you need to do after that is unzip the file wherever you find suitable.

Final step is adjusting your PATH environment variable. It should contain both Java binaries directory and Groovy binaries directory (e.g. /usr/java/bin and /usr/groovy/bin).

You can check if all is set well by executing the command “groovy” from the command line. It should output standard usage message, something like this:

[~]> groovy
error: neither -e or filename provided
usage: groovy [options] [args]
  -a,--autosplit <splitPattern>    split lines using splitPattern (default '\s')
                                   using implicit 'split' variable
  -c,--encoding <charset>          specify the encoding of the files
  -D,--define <name=value>         define a system property
  -d,--debug                       debug mode will print out full stack traces
  -e <script>                      specify a command line script
  -h,--help                        usage information
  -i <extension>                   modify files in place; create backup if
                                   extension is given (e.g. '.bak')
  -l <port>                        listen on a port and process inbound lines
  -n                               process files line by line using implicit
                                   'line' variable
  -p                               process files line by line and print result
                                   (see also -n)
  -v,--version                     display the Groovy and JVM versions

So now we can start groovying. As a teaser, here’s a readable example: calculate bufwaits ratio based on onstat -p output (version 11.50xC9):

groovy -e "out='onstat -p'.execute().in.text.readLines(); \
pg=out[5].tokenize()[1].toLong(); \
bwr=out[5].tokenize()[6].toLong(); \
bwai=out[17].tokenize()[0].toLong(); \
println '\nBufwaits Ratio: ' + ((pg+bwr)>0 ? (bwai*100)/(pg+bwr) : 0)"

Although Groovy doesn’t require a semicolon at the end of the statement, it has to be used here to delimit the statements because this is actually one-liner. You could make a script out of it, bw_ratio.groovy, with the same contents:

out='onstat -p'.execute().in.text.readLines()
println '\nBufwaits Ratio: ' + ((pg+bwr)>0 ? (bwai*100)/(pg+bwr) : 0)

and execute it:

[~/]> groovy bw_ratio.groovy 

Bufwaits Ratio: 0.0602958416

In addition to Groovy home page with loads of examples and information, there are many sites dedicated to this powerful technology, just google for Groovy. There are some cool books about it, but I’d definitely recommend “Groovy in Action” authored by Dierk Konig.

, , ,

1 Comment