Archive for category General

Visual aggregates

There is a number of built-in aggregates in Informix which are there for what seams like forever, and it didn’t change throughout many version now. If you’re using SQL, you surely use aggregates – COUNT, SUM, AVG, to name just a few. And if you want something more, there is a way to create your own aggregate. Only four functions need to be implemented, which can be done in SPL, Java or C. These functions are:

  • init – initialises the data structure needed to calculate the aggregate and states the data types aggregate can be used on, called once at the beginning of aggregate execution
  • iter – called once for each row in the dataset, adds the information brought by that row to current result value of the aggregate and passes it on to next iteration
  • combine – called if aggregate is running in parallel, to combine current results of two threads together
  • final – called at the end of aggregate calculation, returns the final result of the aggregate.

After that, one simple SQL statement creates an aggregate, for instance, if we want to create median aggregate, first we implement median_init, median_iter, median_combine and median_final functions and execute:

CREATE AGGREGATE median2 WITH
   (INIT = median_init, 
    ITER = median_iter,
    COMBINE = median_combine,
    FINAL = median_final);

and it is ready to be used is SQL:

SELECT employee_id, median (salary) FROM income GROUP BY 1;

There is a nice post from Andrew Ford on how to create aggregates, take a look at it: User Defined Aggregates.

The limitation of the aggregate in general, from a data research point of view, is that it returns a single value, which describes the selected dataset in a particular way. If we want to find out something about the dataset, what does it mean to get only one number that sums it – be it a count, an average (mathematical mean), median or anything else? Experienced statistician might get a grasp on data distribution having these numbers if a data set is relatively small, but for a bigger data set, one should export the data and analyse it with some other tool. Or, for instance, pull it into R. 

If an aggregate could return a picture, it would be possible to actually see the distribution of the data from within the SQL, before exploring it further and using some other tools. 

So the idea is to return the blob from the final function, which contains the picture of a histogram, for example. For that to work, one should use an SQL tool which is capable of interpreting the blob data as a picture and displaying it (like ifmx-sql-editor, available here). Since all the data needs to be collected, init, iter and combine functions should be used to collect it all and process it in the final function. Currently, it is not possible to create the picture in SPL, so some kind of external library, whether in C or in Java should be used. 

Here is an example on how to create a histogram aggregate using SPL functions and small Java class (all the code is available here!). In order to collect the data, a list data type is used. Each new row adds new value in the list and passes the list on to the next iteration. This is how the init, iter and combine functions look like:

CREATE FUNCTION histogram_init (dummy INT)
RETURNING LIST(INT NOT NULL);
  RETURN LIST{};
END FUNCTION;

CREATE FUNCTION histogram_iter
  (result LIST(INT NOT NULL), value INT)
  RETURNING LIST(INT NOT NULL);

  INSERT INTO TABLE (result) VALUES (value::INT)

  RETURN result;
END FUNCTION;

CREATE FUNCTION histogram_combine
(partial1 LIST(INT NOT NULL),
 partial2 LIST(INT NOT NULL))

RETURNING LIST(INT NOT NULL);

   INSERT INTO TABLE (partial1)
      SELECT * FROM TABLE(partial2);

   RETURN partial1;
END FUNCTION;

The Java class resides on the database server and is called from the final function via the SYSTEM statement. Created image is put on the server and collected with the fileToBlob function and returned to the user. 

Since one query can result in returning multiple images (if group by is used), the input data set name and the output image name both need to be pretty much unique and provided to the Java class which creates the image. To ensure uniqueness, a pseudo random number generator is used when creating file names. Here is the simplified version of the final function:

CREATE FUNCTION histogram_final (final LIST(INT NOT NULL)) 
RETURNING BLOB;

  DEFINE...

  LET rnd = sp_random();
  LET in_file = "in_" || rnd || ".txt";
  LET out_file = "out_" || rnd || ".png";

  FOREACH
    SELECT * INTO value FROM TABLE(final) ORDER BY 1

    SYSTEM "echo " || value || " >> " || in_file; 
  END FOREACH

  SYSTEM "sh Histogram " || in_file || out_file; 

  RETURN filetoBLOB("out_" || rnd || ".png", 'server');
END FUNCTION;

Java class should have a main method, accept two parameters, and create the given picture.  It can use the simple Graphics object to create and store the picture. This is the idea:

public class Histogram {


  public static void main(String[] args) {
    // read the data from the text file args[0]

    // calculate min, max, frequencies

    // set up image
    BufferedImage bImg = ...
    Graphics g = bImg.createGraphics();

    // draw histogram lines

    // draw min and max

    // write an image
    ImageIO.write(bImg, "png", new File(args[1]));
  } 
}

Here is the example of the call and a screenshot of the SQL editor:

Screen Shot 2018-10-06 at 16.36.33 copy

Simillary to this, other aggregates can be implemented as well. For instance box and whicker plot, violin plot, etc.

All the code mentioned here, as well as the median aggregate which is currently not implemented in informix, can be found at my GitHub (direct link: https://github.com/ognjenorel/ifmx-utils/tree/master/aggregates)     

Advertisements

, , , ,

Leave a comment

Been a While

Title of my blog is Informix on my mind, but since I haven’t published anything for some time, one could argue that Informix was not on my mind lately :). Correctly, but only to some extend. Most of the time I’ve been busy finishing my doctorate research which I procrastinated for too long and finally defending my PhD thesis. During this time, from my perspective, only couple of things worth mentioning happened (which doesn’t mean I’m going to stop writing).

First, I gave a presentation on IIUG 2017 Conference in Raleigh, NC. In this presentation I covered some of the advantages of graph analyses and gave arguments that graph structure could be used for OLAP purposes. I also presented the algorithm and my open-source implementation of an ETL process in order to create a graph OLAP database from Informix database. Stay tuned, I’ll cover this in more detail, along with links to the code and presentation in a separate post.

Second, and of much more importance to global Informix community, was the IBM-HCL partnership deal about Informix development and maintenance. It was announced before the IIUG Conference in April, and produced a lot of different reactions, from “IBM finally killed it” to “this is a long-expected good news”. I’m not going to elaborate on the details of the deal since there is plenty of information available (let me google it for you). I was among those who were optimistic about this change, since IBM’s work on Informix was slow, especially in the last couple of years. On top of that, many great people whom worked on Informix left IBM or were moved from the product.

Now, it has been nearly half a year since HCL partnership was introduced and we’re all still waiting for any sign that this works. In this time, HCL hired some very good and Informix-caring people, for which I have high hopes that will succeed in moving things in right direction. Yet, we are still to see what happens, and I wonder how much longer should we wait.

HCL, the community is here, listening and it’s anxious.

 

, , , ,

Leave a comment

IIUG Conference 2015

The IIUG 2015 Conference in San Diego is getting very close. Next week, throughout the conference with over 80+ presentations, keynotes and other events, the IIUG will celebrate its 20th birthday! And following the “conference for users, by users” policy, I will present on some of the tools (mostly open-source) and techniques my team uses in a daily work with Informix. It will be a run-through different software, from scripts and SQL tools to application development and testing techniques we use in development, maintenance and support of big information systems. Though some of the stuff will certainly be familiar to a part of the audience, I hope everyone gets to hear and learn about something new.

Find the info about the conference and the schedule at www.iiug2015.org. For developers, it should be especially pleasing to see a number of developer-oriented topics – Java, Hibernate, web security, etc – just follow the D track.

See you at IIUG 2015!

, ,

Leave a comment

Continuing Strong in 2015

Gartner guys reacted on a recent analyst report which stated that in 2015, the industry would begin retiring Sybase and Informix products. We’ve already seen so many such reports stating Informix is dead, or will be in the near future, that I’m used to ignoring all of them, but it’s nice to see Gartner investigate and state the opposite.

Here is the link to the article (first there is a part on Sybase, and Informix follows). What gives additional warmth to the heart, is the recognition of an IIUG as one of the important things when talking Informix. Kudos, Donald and Merv!

Leave a comment

What Does It Mean for Us?

The big announcement in the world of Informix the other day was IBM signing the agreement with Chinese company General Data Technology or GBASE, which is, based on information on their website, a key software enterprise in the state planning. The base of the agreement (copy of the news is here) is that IBM shares the code of Informix, and GBASE will (in support of the China government’s national technology agenda) modify its security to conform to standards of Chinese government. That modified Informix will then be used in the future database projects.

So this is obviously big news for Informix, because it guarantees the expansion of the product to a fast growing market, but what does it exactly mean for us, non-IBMers involved in the Informix business?
Well, I’m looking at this in a good faith, so my personal thoughts are positive. This kind of agreement will result in bigger demand for Informix experts, both domestic and foreign. I have to admit that I have no idea about the state of affairs regarding Informix experts in China, but I presume it is definitely possible there will be a need for database designers, architects, and even bigger need for DBAs to maintain the big installations we all imagine when thinking of databases in China. So it’s quite possible some of us will find our careers continuing somewhere in east Asia or telecommuting for a local partner in China. Other than that, new DBAs should be properly trained, so that’s also another opportunity for all of us involved in teaching and training. This also means the growth of Informix community, and hopefully there will be more international community members springing out of these systems.

But most importantly, this means a long term survival of Informix as a product, which is of course in the best interest of all the Informix people. As Mr. Art Kagel said, this is the proof Informix is here to stay. And a little dream for the end, to share with you… would be nice to have this kind of commitment in other countries as well. Just saying.

, ,

Leave a comment

The Book: Data Just Right

I realized there is a number of books covering databases and data handling that I go through or am in touch with, but I never mention any of those or give any credit to authors, so I’m gonna change that starting today.

Recently I came across the book called Data Just Right by Michael Manoochehri. It is subtitled Introduction to Large-Scale Data & Analytics. Books with this kind of name could hide anything, quite often restraining themselves to one or two technologies, but this is not the case with this one. It is a review of a current state of data management and analytics, with a lot of sense for data management history and current needs and trends in this field. Reading this book won’t teach you how to use Hadoop, Pig, R or anything like it. It will give you the perspective of various technologies used today, show some examples and try to help you find the right tools for your needs.

What I found interesting about it is the broadness of technologies and ideas being covered. In the book, especially in the opening chapters, there are so many products, languages, tools, names, methodologies mentioned, that only a selected few of data experts could know about all of them. Codd, OLAP, NewSQL, BigQuery, SOX, Tableau, SciPy, to name just a few. For a book of only 200 pages, there is an index of more than a 1200 entries in it. So, in my humble opinion, this is why this book is worth going through – it gives a good perspective of data technologies to any kind of reader, data management novice, expert, CIO, CTO. In the same time, this is a burden for the book, because it will require some changes in the following editions to stay current in a fast changing data management and analytics landscape. This first edition is certainly worth reading.

More info about the book on its website: datajustright.com.

,

Leave a comment

Working with JSON Data from SQL

The MongoDB support was introduced in 12.10xC2, bringing many cool things to Informix, one of them being JSON and BSON data types. Putting all the NoSQL and MongoDB story aside, these new data types enable us to work with semi-structured data directly from SQL, thanks to several new built-in functions. Of course, you could do the same with XML documents, but it took a while before all the necessary functions became available in Informix, and working with XML is still more complex than working with JSON, because of the difference of those two data formats.

In order to put the data in a JSON column you can use genBSON function, or simply cast a text to a JSON type. Here’s an example – a tourist office database table storing various places people could visit. One table with JSON data could be used to store data of many different places – cities, regions, islands, landmarks etc. So the table could be defined as:

CREATE TABLE places (
  place_id SERIAL,
  numberOfVisitorsPerYear INT,
  place BSON
);

The place column could also be of JSON type, but if you want to perform more meaningful queries on a table, stick to BSON. There are some BSON functions we can use, and BSON can be cast to JSON data.

Rows could be inserted via plain SQL:

INSERT INTO places VALUES (0, 500000, '{city: "Zagreb"}'::JSON);

Note that last value needs to be cast to JSON in order to be able to run queries on it with the bson_value functions. Here are some other data rows with various attributes describing places:

INSERT INTO places VALUES (0, 600000, '{city: "Pula", country: "Croatia", population: 57000}'::JSON);
INSERT INTO places VALUES (0, 20000, '{mountain: "Velebit", country: "Croatia", height: 1757}'::JSON);
INSERT INTO places VALUES (0, 1000000, '{national_park: "Plitvice", country: "Croatia"}'::JSON);

Simplest way to find out what is stored in a table is to execute query like this one:

SELECT place_id, numberOfVisitorsPerYear, place::JSON
FROM places

which will return these results:

place_id numberOfVisitorsPerYear (expression)
1 500000 {“city”:”Zagreb”}
2 600000 {“city”:”Pula”,”country”:”Croatia”,”population”:57000}
3 20000 {“mountain”:”Velebit”,”country”:”Croatia”,”height”:1757}
4 1000000 {“national_park”:”Plitvice”,”country”:”Croatia”}

However, the idea is to be able to search within JSON data. For that purpose, there are some new functions we can use:

  • bson_value_int(column, bson_attribute) – returns an integer value of the bson_attribute stored in a specified column of the row
  • bson_value_bigint(column, bson_attribute) – returns a bigint value of the bson_attribute stored in a specified column of the row
  • bson_value_double(column, bson_attribute) – returns an double value of the bson_attribute stored in a specified column of the row
  • bson_value_lvarchar(column, bson_attribute) – returns an lvarchar value of the bson_attribute stored in a specified column of the row

Here are some query examples:

-- find all destinations in Croatia 
SELECT *, place::JSON FROM places 
WHERE bson_value_lvarchar (place, 'country') = 'Croatia'; 

-- find all destinations without specified county 
SELECT *, place::JSON FROM places 
WHERE bson_value_lvarchar (place, 'country') IS NULL; 

-- find all mountains higher than 1000 meters 
SELECT *, place::JSON FROM places 
WHERE bson_value_lvarchar (place, 'mountain') IS NOT NULL 
AND bson_value_int (place, 'height') > 1000; 

-- find all national parks in Croatia 
SELECT *, place::JSON FROM places 
WHERE bson_value_lvarchar (place, 'country') = 'Croatia' 
AND bson_value_lvarchar (place, 'national_park') IS NOT NULL; 

There is another new function, genBSON, which generates BSON/JSON data from a relational table’s data, depending on a query. It can be used to return JSON directly from a query or to insert a JSON data in a column. The Informix Knowledge Center support page for this function is informative, along with some examples, so I’m not going to repeat it all here. As a continuation to our example, if a tourist office already has a relational table named cities in its database, then this data could be imported in a places table with a single SQL statement:

-- cities relational table 
CREATE TABLE cities (
  city_id SERIAL, 
  city CHAR(30), 
  population INT, 
  country CHAR(30), 
  numberOfVisitorsPerYear INT
); 

-- copy the cities data into places table: 
INSERT INTO places 
SELECT 0, numberOfVisitorsPerYear, genBSON(ROW(city, country, population), 0, 1)::JSON
FROM cities; 

Or, if we don’t want to have a copy of cities in places, a view on structured and semi-structured data could be made (this one returning only JSON data):

CREATE VIEW places_and_cities (place) AS 
  SELECT place::JSON FROM places 
  UNION 
  SELECT genBSON(ROW(city, country, population), 0, 1)::JSON FROM cities; 

In conclusion, with JSON capabilities at hand, it’s now pretty simple to mix structured and semi-structured data in a single database. But before we do it, we should make sure there is a proper need to design our data model that way, bearing in mind there are numerous advantages of having relational data in a relational database.

, , ,

Leave a comment