Planning to spend the weekend reading/watching H2O stuff:

I intend to use the latest major version of H2O (3.x) which has been recently released.


It might be a little hasty of me to try and figure out the future of this blog given that I restarted posting here only two days ago, but there are some things that I find restrictive about sites:

  • the inability to have a full \LaTeX implementation. The default \LaTeX only comes with the amsmath, amsfonts and amssymb installed and that is not likely to be enough in the long run. On the other hand, WP QuickLaTeX, for, provides fully featured \LaTeX, including support for TikZ and pgfplots, which I find extremely impressive.
    In addition, because of the multi-user (multi-tenancy) nature of WP blogs, no custom Javascript can be part of a blog. Among other things, this means that Mathjax for rendering math is not an option either;
  • for the same reason as above, that is, the lack of custom Javascript on blogs, it is not possible to use Google Analytics tracking on websites

So, over the next couple of days, I will be evaluating a couple of options to move this site back to self-hosted, including:

  • GoDaddy
  • FatCow
  • Bluehost
  • dedicated AWS instance

I am not a big fan of the hosting providers because I always feel confused by their hosting plans and the control panels that they offer to customers — it is like learning a whole new system just to be able to host a website with that hosting provider. The other thing that worries me about hosting services like GoDaddy is that their business practices are not always transparent — they tend to have hidden costs, auto-renewals built in, and also clauses that do not allow domain names to be easily applied to sites hosted elsewhere. This might be FUD, but removing it would need a substantial amount of investment in reading up and finding out all the pitfalls of hosting services like GoDaddy.

However, given that all I plan to do is to only host a WordPress blog, I feel that going the AWS route might be too costly. I also know that if I go the AWS route, then I will be tempted to move the entire site to Django or Flask, and that is not a time and effort investment I want to make right now. Even so, setting up and maintaining a wordpress blog on AWS takes some work, and is not costly. There is the option of reserving an instance for a year at $0.009 an hour, but that would be a t2.micro instance, and I am not sure that it would suit the purpose of serving up a website. There are some options if you want to get rid of a reserved instance mid-way through its tenure, and also there is the possibility of upgrading an instance if the load becomes too high, but both of those options require work.

So I think I will sleep on this decision for a few days. I am not going to be making a decision on this until one of GoDaddy or FatCow provides me with an email discount coupon anyway, which they do without fail every month.

This blog used to be previously called Large Deviations, and was self-hosted, although I also like Information Matrix and Random Measures as names for a blog that is largely focused on statistics and data science.

Not being a gamer, I was not aware that some gaming keyboards, in particular, my recently acquired Cooler Master Quickfire Rapid I, have a Windows key lock that prevent exiting from gaming mode in case that the Windows key is pressed accidentally I use the Windows key quite frequently on Ubuntu to emulate the Windows “Snap to Sides” feature that maps Ctrl+Win+RightArrow to snap a window to the right of the screen, for example. I was not able to figure out why it was not working with my new keyboard. Turns out the Windows key lock is turned on by default, and can be turned off using Fn+PrtSc.

I spent some amount of time trying to figure out if I could get Google Analytics to track the traffic on my account, but it seems that this is only a feature available using WordPress Business plan, or if I migrate the site to

I also spent some time figuring out how to set up a new account and property within a Google Analytics account. And then there is the question of configuring useful views of the data within a property (Account > Property > Views). For now, I have created two accounts within my GA account, one for my main site and the other for all the blogs I have on services such as tumblr and this site on

I googled and put together a lightly curated set of links that I want to read to collate a strategy on how to prepare for a data science interview.

Since I will be answering questions on programming using Python, here are some links on Python programming interview questions:

I must admit that I don’t really like the look of some of these links as they seem too focused on syntax or clever and compact ways of achieving trivial things, and not on the language big picture, architecture of applications or actually solving difficult problems. In addition, these don’t cover some of the core packages that a data scientist should know, such as NumPy, SciPy or scikit-learn, and to a lesser extent pandas and statsmodels.

For reference, here is my implementation of FizzBuzz.

for x in range(101):
    if x % 15 == 0:
    elif x % 3 == 0:
    elif x % 5 == 0:

This is the vanilla implementation that can be found all over the internet, and the trick really is to handle division modulo 15 first rather than last, as it reads in the question. Everything else is standard syntax. At some point, I have implemented it using generators, but I can’t find that solution, and I don’t want to get distracted from pulling together links for my interview prep.

Over the next couple of days I intend to work through these links, and I will see if I can refine down to a core strategy that I can use to preapre and review before interview. I also intend to blog about some of the peripheral aspects of job search such as writing a good resume and cover letter, interview etiquette and searching for relevant jobs.

Spark is the flavour of the season when it comes to distributed computing frameworks, and I have been caught up in the excitement. I have narrowed down the set of Spark resources to three, which I am going to try and use over the next 3 months to try and learn Spark:

  • The first is Advanced Analytics with Spark, which is hot off the presses, but has already gotten very good reviews on Amazon. Not surprising given that Sean Owen is one of the authors.
  • The second, of which I have already the first few chapters, but intend to systematically re-read over the next month, is Learning Spark. I must admit, I am somewhat intrigued by how much Matei Zaharia has managed to achieve at (what I am guessing is) a relatively young age.
  • Lastly, there is a brand new course on Spark that has just started. It is a little disappointing that the course uses Python 2.7 and Spark 1.3, instead of moving to the impending release of Spark 1.4, which also supports SparkR and Python 3.4. I think that the Spark guys are holding off on releasing the latest version of Spark till the Spark summit later this month.

In any case, Spark is an important new technology for data analysis, and significant improvement over the disk-only storage model of Hadoop MapReduce. That is not to say that Spark is the only in-memory distributed computing framework that can do ad hoc querying, machine learning, and graph processing on big data — there is the newly promoted to top-level project Apache Flink, but Spark definitely appears to have a significant head start. I look forward to learning more about Apache Spark.

It has been a long time since I blogged here, so here is a quick update:

  • Coffee is still as important to me as it was, but I no longer forget to consume it as part of my daily routine.
  • I still struggle with typing. I have tried several different keyboards over the past 3 years, including the Microsoft Ergonomic 4000 (of which I have gone through 3), the Cooler Master Rapid-i tenkeyless, which I got recently, and the Logitech K400r.
  •  I have been constantly reminded of Julia sporadically over the years, as people experiment with it, and as the tooling and support for Julia has gotten better over. The second JuliaCon is going to be held in Cambridge, Mass this month. I have not experimented with Julia since the time the last blog post mentioning Julia was written on this blog.
  • I still struggle with the pros and cons of multiple computing devices, and over the last 3 years, smartphones have only added to that struggle. I now own 4 (!) laptops, 2 mobile phones, and a (defunct) desktop PC. 3 of the 4 laptops now run Ubuntu (14.04, 14.10 and 15.04 (!)), and the other one runs Windows 7. One of my phones runs Android Lollipop, and the other runs Windows 8.1 (looking forward to the Windows 10 update).
    I constantly forget which machines I have updated and which ones contain a particular piece of software. I have myriad unwritten rules about which machines are to be used for what, but most of those are compromised in favor of just doing what is most convenient.
  • I did buy the domain, and set up a new blog on that page, but that blog never contained anything interesting at all, and I eventually shut it down, but not before it was auto-renewed for a year, and I vowed to use it for something productive, but didn’t.
  • I never got anywhere with OCaml or F# or any of the other functional programming languages that I wanted to learn at that point. I don’t think I ever tried after writing that blog post.
  • I did end up investing a fair amount of effort in learning how to, and maintaining a couple of R and Python package projects. They are not terribly complex, but it was helpful to understand the process of package creation, documentation, and distribution. For those who are looking to create R packages, even though RStudio makes packaging R code extremely simple, might benefit from Hadley Wickham’s excellent new book R Packages.
  • I completely abandoned trying to work with Matlab or Octave since I discovered NumPy and I even thought about translating the best known example of the use of Matlab — Andrew Ng’s machine learning course on Coursera —  using NumPy and SciPy, but have not gotten around to it yet.
  • I would say that I am better at R than I used to be. I will leave it at that. I did manage to do a lot of work that integrates C++ code with R using Rcpp.
  • I never took up the work on understanding reinforcement learning and dynamic programming, even though I have had problems that I could potentially solve using those methods.