Posts Tagged open-source

Visual aggregates

There is a number of built-in aggregates in Informix which are there for what seams like forever, and it didn’t change throughout many version now. If you’re using SQL, you surely use aggregates – COUNT, SUM, AVG, to name just a few. And if you want something more, there is a way to create your own aggregate. Only four functions need to be implemented, which can be done in SPL, Java or C. These functions are:

  • init – initialises the data structure needed to calculate the aggregate and states the data types aggregate can be used on, called once at the beginning of aggregate execution
  • iter – called once for each row in the dataset, adds the information brought by that row to current result value of the aggregate and passes it on to next iteration
  • combine – called if aggregate is running in parallel, to combine current results of two threads together
  • final – called at the end of aggregate calculation, returns the final result of the aggregate.

After that, one simple SQL statement creates an aggregate, for instance, if we want to create median aggregate, first we implement median_init, median_iter, median_combine and median_final functions and execute:

CREATE AGGREGATE median2 WITH
   (INIT = median_init, 
    ITER = median_iter,
    COMBINE = median_combine,
    FINAL = median_final);

and it is ready to be used is SQL:

SELECT employee_id, median (salary) FROM income GROUP BY 1;

There is a nice post from Andrew Ford on how to create aggregates, take a look at it: User Defined Aggregates.

The limitation of the aggregate in general, from a data research point of view, is that it returns a single value, which describes the selected dataset in a particular way. If we want to find out something about the dataset, what does it mean to get only one number that sums it – be it a count, an average (mathematical mean), median or anything else? Experienced statistician might get a grasp on data distribution having these numbers if a data set is relatively small, but for a bigger data set, one should export the data and analyse it with some other tool. Or, for instance, pull it into R. 

If an aggregate could return a picture, it would be possible to actually see the distribution of the data from within the SQL, before exploring it further and using some other tools. 

So the idea is to return the blob from the final function, which contains the picture of a histogram, for example. For that to work, one should use an SQL tool which is capable of interpreting the blob data as a picture and displaying it (like ifmx-sql-editor, available here). Since all the data needs to be collected, init, iter and combine functions should be used to collect it all and process it in the final function. Currently, it is not possible to create the picture in SPL, so some kind of external library, whether in C or in Java should be used. 

Here is an example on how to create a histogram aggregate using SPL functions and small Java class (all the code is available here!). In order to collect the data, a list data type is used. Each new row adds new value in the list and passes the list on to the next iteration. This is how the init, iter and combine functions look like:

CREATE FUNCTION histogram_init (dummy INT)
RETURNING LIST(INT NOT NULL);
  RETURN LIST{};
END FUNCTION;

CREATE FUNCTION histogram_iter
  (result LIST(INT NOT NULL), value INT)
  RETURNING LIST(INT NOT NULL);

  INSERT INTO TABLE (result) VALUES (value::INT)

  RETURN result;
END FUNCTION;

CREATE FUNCTION histogram_combine
(partial1 LIST(INT NOT NULL),
 partial2 LIST(INT NOT NULL))

RETURNING LIST(INT NOT NULL);

   INSERT INTO TABLE (partial1)
      SELECT * FROM TABLE(partial2);

   RETURN partial1;
END FUNCTION;

The Java class resides on the database server and is called from the final function via the SYSTEM statement. Created image is put on the server and collected with the fileToBlob function and returned to the user. 

Since one query can result in returning multiple images (if group by is used), the input data set name and the output image name both need to be pretty much unique and provided to the Java class which creates the image. To ensure uniqueness, a pseudo random number generator is used when creating file names. Here is the simplified version of the final function:

CREATE FUNCTION histogram_final (final LIST(INT NOT NULL)) 
RETURNING BLOB;

  DEFINE...

  LET rnd = sp_random();
  LET in_file = "in_" || rnd || ".txt";
  LET out_file = "out_" || rnd || ".png";

  FOREACH
    SELECT * INTO value FROM TABLE(final) ORDER BY 1

    SYSTEM "echo " || value || " >> " || in_file; 
  END FOREACH

  SYSTEM "sh Histogram " || in_file || out_file; 

  RETURN filetoBLOB("out_" || rnd || ".png", 'server');
END FUNCTION;

Java class should have a main method, accept two parameters, and create the given picture.  It can use the simple Graphics object to create and store the picture. This is the idea:

public class Histogram {


  public static void main(String[] args) {
    // read the data from the text file args[0]

    // calculate min, max, frequencies

    // set up image
    BufferedImage bImg = ...
    Graphics g = bImg.createGraphics();

    // draw histogram lines

    // draw min and max

    // write an image
    ImageIO.write(bImg, "png", new File(args[1]));
  } 
}

Here is the example of the call and a screenshot of the SQL editor:

Screen Shot 2018-10-06 at 16.36.33 copy

Simillary to this, other aggregates can be implemented as well. For instance box and whicker plot, violin plot, etc.

All the code mentioned here, as well as the median aggregate which is currently not implemented in informix, can be found at my GitHub (direct link: https://github.com/ognjenorel/ifmx-utils/tree/master/aggregates)     

Advertisements

, , , ,

Leave a comment

ETL for Graph Mining

In the world of data mining and analyses, there is a special set of mining methods and algorithms related to data in graphs. Graph is mathematical construct which contains vertices (nodes) and edges between them. Edge connects a pair of vertices and can be directed or undirected. As a data structure, graph is used in various extensions of this simple model. One of them is the “property graph model”, which allows vertices and edges to be uniquely identified, labeled (with one or more labels), and contain a set of key-value properties. There are a number of NoSQL graph databases which employ this data model.

Graph analyses provide measures and algorithms that help find some information which is otherwise very hard to find (i.e. using relational model or a data warehouse). Some examples: find most important people (in a social network of a kind), most important pages (on a web site or collection of sites), most important servers (in a network); decide who/what needs best protection (from flu, cyber-attacks, ..); find fake profiles and reviews on the web; cut costs (the traveling salesman problem) etc. Some of these measures are centrality measures (e.g. betweenness, closeness) or locality measures (e.g. indegree, outdegree). Also, there is a bunch of algorithms which are aimed towards identifying frequent patterns or subgraphs in a graph, or identifying interesting subgraphs or anomalies in a graph as well.

So, when I have a collection of data in my favorite relational database, how can I make use of all those graph algorithms? There are several possibilities, such as implementing the algorithms inside a relational database, but my answer to this is to perform an ETL (extract-transform-load) process in order to extract the data from the relational database and create a graph database from it. Then I can perform whatever analysis I want on this separate graph OLAP system. To do that, I defined a relational-to-graph ETL process which follows a set of 10 rules. These rules define the way the rows and their attributes (with regard to primary and foreign keys) are transformed to vertices, edges and their properties. I will not go into the details of these rules here, but if interested, please go to slides 30-40 of my IIUG2017 presentation (available here) to find explained rules with examples.

Is there any code? Of course there is :). I implemented the process as a series of stored procedures and a view, which need to be created in a relational database you want to extract the data from. After the calls:

execute procedure create_conversion_statements('~\my_working_dir');
execute procedure create_conversion_scripts('~\my_working_dir');

several scripts are created in the specified directory. One script is an SQL unload-transform script, which need to be executed in the same relational database to create a bunch of data files (one per table). Other files are Cypher scripts (Cypher is an open language used by several graph database systems, like Neo4j). These scripts can be executed in a graph database of your choice in order to create and populate it with the exported data. Once the data is in the database, you can start exploring it using Cypher or some of the graph algorithms.

All the code is open-sourced on GitHub, available here, under my ifmx-utils project.

The examples and more info can also be found in the IIUG presentation.

Formal description of the algorithm and the science around it is described in the paper Property Oriented Relational-To-Graph Database Conversion.

 

Hope you’ll find some of this useful.

, , , , ,

Leave a comment

Been a While

Title of my blog is Informix on my mind, but since I haven’t published anything for some time, one could argue that Informix was not on my mind lately :). Correctly, but only to some extend. Most of the time I’ve been busy finishing my doctorate research which I procrastinated for too long and finally defending my PhD thesis. During this time, from my perspective, only couple of things worth mentioning happened (which doesn’t mean I’m going to stop writing).

First, I gave a presentation on IIUG 2017 Conference in Raleigh, NC. In this presentation I covered some of the advantages of graph analyses and gave arguments that graph structure could be used for OLAP purposes. I also presented the algorithm and my open-source implementation of an ETL process in order to create a graph OLAP database from Informix database. Stay tuned, I’ll cover this in more detail, along with links to the code and presentation in a separate post.

Second, and of much more importance to global Informix community, was the IBM-HCL partnership deal about Informix development and maintenance. It was announced before the IIUG Conference in April, and produced a lot of different reactions, from “IBM finally killed it” to “this is a long-expected good news”. I’m not going to elaborate on the details of the deal since there is plenty of information available (let me google it for you). I was among those who were optimistic about this change, since IBM’s work on Informix was slow, especially in the last couple of years. On top of that, many great people whom worked on Informix left IBM or were moved from the product.

Now, it has been nearly half a year since HCL partnership was introduced and we’re all still waiting for any sign that this works. In this time, HCL hired some very good and Informix-caring people, for which I have high hopes that will succeed in moving things in right direction. Yet, we are still to see what happens, and I wonder how much longer should we wait.

HCL, the community is here, listening and it’s anxious.

 

, , , ,

Leave a comment

New Utility – Ifmx Table Copy

Recently a friend of mine who meets a lot of Informix customers told me he frequently gets a question about copying a table with all the data in it and suggested that could be a good idea for a small utility. Having a SELECT …INTO permanent_table syntax available in Informix 12.10, there wouldn’t be much sense in creating such utility, but there’s another request – would be very nice if the constraints (primary and foreign keys, etc.) were copied as well.

Being a small piece of code, it would be most reasonable if it could be executed in a database itself, with no need for additional software to run it, and every DBA should be able to run some SQL statements. So I wrote a couple of stored procedures (well, functions to be exact) and these are actually all you need to run it. This little project is also open-sourced, released under GNU GPL v3 license, hosted on GitHub as ifmx-table-copy. There are two functions you need to create in your database (links lead to SQL files): getIdxColumnsForTableCopy and tableCopy.

When tableCopy is executed, it will create a copy of the table, with a new name, populate it with the data from the source table, and create the primary key, foreign keys, check, not null, and unique constraints and also indexes that exists on the source table, on the corresponding columns of the target table. The names of constraints are created from or concatenated with the target table name (see the details on the project page). The default values that exists on source table columns are also stated in the target table, provided they refer to character types. It is possible to state a different dbspace in which the table will be created. If no dbspace is given (i.e. the corresponding argument is null), then the table is created in the default database dbspace. There is also a parameter that prevents the execution of all the statements that would be executed during the target table creation – instead the statements are returned from the function and can be examined before the “real” execution.

Execute the tableCopy function with the following arguments to see what will actually be executed on the server:

execute function tablecopy('source_table_name', 'target_table_name', null, 'f');

Execute the following to perform actual structure creation and data copy:

execute function tablecopy('source_table_name', 'target_table_name', 'db_space', 't');

This utility uses the SQL syntax available in the Informix servers 11.70 and newer. There are some minor modifications that could be done to achieve backward compatibility. For more info on that, please see the project’s readme file.

, , ,

Leave a comment