Saturday, November 17, 2012

Coursera Scala Course

I recently took the Scala Course on Coursera and thoroughly enjoyed it. It is the first time I've completed an on-line course - i usually slacken of. The combination of it being interesting and having other friends, like Greg, doing it was what pushed me to complete all of the assignments.

Overall the course was well structured and well run. A credit to Martin Odersky and his staff. If i could do it again, i'd probably try and watch more of the videos! Finding the time is always a struggle.

All said and done, i'm still not sure how i feel about Scala. You can create some very concise code in it, but i also find i can easily get tied in knots.  Still, i think it is good enough to try out on a small project before i make any firm judgements.

It would be great to do something similar in Clojure. I've dabbled here and there, but i'm a long way from being proficient. Next year maybe.

Thursday, November 1, 2012

Performance Analysis with Logscape

At work we have built a object based query model that works directly off of our domain objects. It is quite neat in that when we now add new fields / objects to our domain we can automatically construct queries on the new field/object.  The underlying data is stored in an Oracle Coherence cluster - there is lot of data!

As with any new method of querying, at some point you need to understand the performance of various components. As we are using coherence, getting down to the nitty gritty of where time is being spent can be difficult. For example, a query is composed on a client machine, is serialized and sent to an extend proxy, is then interpreted into something that coherence will understand, broken up in to pages, run against storage nodes, aggregated back on an extend node, populated with extra data, and finally sent back to the client - phew! A single query will span across 20 machines and 200+ JVMs. 

In order to collect some performance metrics for this we decided to dump instrumentation data to our log files.  We decided on the following pipe delimited format:

INFO  instrumentation - |8427b9bb-b371-4d2b-9383-c7784456068b|de64d0bb-3f16-41ba-a50d-598fe89ca313|0|METRIC|StorageNodeAggregationTime|0
   
Why pipe delimited? We use Logscape for searching our data. It can easily break up a pipe delimited output into sensible fields. To split the data into field we use the method described here. Specifically we created synthetic fields for queryId, instrumentation, metric, took, using option 2

split,\|,1

This is very fast, especially when you consider Logscape is running and indexing on all 20 machines in the cluster.

Now we have this set up we can run queries like this:

type='fabric' instrumentation | instrumentation.not(Artifact) took.avg(metric) chart(table) buckets(1)
The above query will search across all log files mapped to the fabric data type. It will initially match on lines containing the word 'instrumentation',it then filters out all of the lines where the instrumentation field is Artifact (we only care about METRIC) and then sums the Took field for each metric. All of this produces a table like below


This is an average time spent by all queries that have run on our cluster (time is in milliseconds). Whilst this is useful, we usually just want to see a sum of the times for a single query. We achieve this with a slightly different search:


type='fabric' instrumentation | queryId.equals({0}) instrumentation.not(Artifact)  took.sum(metric) chart(table) buckets(1)
This query introduces a variable {0} we can either fill this in via the variables text field in logscape or we can generate a link that will open the search and fill this in automatically. Instead of avg we use sum as we want to see what the total timings were for a particular query. We end up with another table:

So we can see that a lot of time is spent populating dimensions - we now can go make some changes to the code, run it again, and hopefully get some better figures. Anyway, go checkout Logscape it is a great bit of software for analyzing large amounts of machine data spread out across many servers.