Spark is the flavour of the season when it comes to distributed computing frameworks, and I have been caught up in the excitement. I have narrowed down the set of Spark resources to three, which I am going to try and use over the next 3 months to try and learn Spark:
- The first is Advanced Analytics with Spark, which is hot off the presses, but has already gotten very good reviews on Amazon. Not surprising given that Sean Owen is one of the authors.
- The second, of which I have already the first few chapters, but intend to systematically re-read over the next month, is Learning Spark. I must admit, I am somewhat intrigued by how much Matei Zaharia has managed to achieve at (what I am guessing is) a relatively young age.
- Lastly, there is a brand new edX.org course on Spark that has just started. It is a little disappointing that the course uses Python 2.7 and Spark 1.3, instead of moving to the impending release of Spark 1.4, which also supports SparkR and Python 3.4. I think that the Spark guys are holding off on releasing the latest version of Spark till the Spark summit later this month.
In any case, Spark is an important new technology for data analysis, and significant improvement over the disk-only storage model of Hadoop MapReduce. That is not to say that Spark is the only in-memory distributed computing framework that can do ad hoc querying, machine learning, and graph processing on big data — there is the newly promoted to top-level project Apache Flink, but Spark definitely appears to have a significant head start. I look forward to learning more about Apache Spark.
It has been a long time since I blogged here, so here is a quick update:
- Coffee is still as important to me as it was, but I no longer forget to consume it as part of my daily routine.
- I still struggle with typing. I have tried several different keyboards over the past 3 years, including the Microsoft Ergonomic 4000 (of which I have gone through 3), the Cooler Master Rapid-i tenkeyless, which I got recently, and the Logitech K400r.
- I have been constantly reminded of Julia sporadically over the years, as people experiment with it, and as the tooling and support for Julia has gotten better over. The second JuliaCon is going to be held in Cambridge, Mass this month. I have not experimented with Julia since the time the last blog post mentioning Julia was written on this blog.
- I still struggle with the pros and cons of multiple computing devices, and over the last 3 years, smartphones have only added to that struggle. I now own 4 (!) laptops, 2 mobile phones, and a (defunct) desktop PC. 3 of the 4 laptops now run Ubuntu (14.04, 14.10 and 15.04 (!)), and the other one runs Windows 7. One of my phones runs Android Lollipop, and the other runs Windows 8.1 (looking forward to the Windows 10 update).
I constantly forget which machines I have updated and which ones contain a particular piece of software. I have myriad unwritten rules about which machines are to be used for what, but most of those are compromised in favor of just doing what is most convenient.
- I did buy the domain, and set up a new blog on that page, but that blog never contained anything interesting at all, and I eventually shut it down, but not before it was auto-renewed for a year, and I vowed to use it for something productive, but didn’t.
- I never got anywhere with OCaml or F# or any of the other functional programming languages that I wanted to learn at that point. I don’t think I ever tried after writing that blog post.
- I did end up investing a fair amount of effort in learning how to, and maintaining a couple of R and Python package projects. They are not terribly complex, but it was helpful to understand the process of package creation, documentation, and distribution. For those who are looking to create R packages, even though RStudio makes packaging R code extremely simple, might benefit from Hadley Wickham’s excellent new book R Packages.
- I completely abandoned trying to work with Matlab or Octave since I discovered NumPy and I even thought about translating the best known example of the use of Matlab — Andrew Ng’s machine learning course on Coursera — using NumPy and SciPy, but have not gotten around to it yet.
- I would say that I am better at R than I used to be. I will leave it at that. I did manage to do a lot of work that integrates C++ code with R using Rcpp.
- I never took up the work on understanding reinforcement learning and dynamic programming, even though I have had problems that I could potentially solve using those methods.
I have some idea of dynamic programming problems, based on my graduate level macroeconomics courses. But what I am trying to figure out are the kinds of machine learning problems that DP is most effective in providing insights into, and the kinds of insights that DP can provide.
I aim to collect some simple examples over the next few weeks that demonstrates the things that you can get out of setting up problems as reinforcement learning or DP problems. Stay tuned.
The similarity of the Julia programming language to Matlab and its syntax makes it very easy to translate simple Matlab programs into Julia code. The following code shows simulating linear regression parameter and standard error estimation in Julia and Matlab. The superficial similarity of the code is remarkable.
First, here is the Matlab code:
vY = randn(100, 1); % outcome variable
mX = randn(100, 4); % design matrix
iN = size(vY, 1); % sample size
vBeta = (vY\mX)'; % estimated coefficients
vE = vY - mX*vBeta; % residuals
dSigmaSq = vE'*vE/iN; % residual variance
mV = dSigmaSq.*(inv(mX'*mX)); % covariance matrix
vStdErr = diag(mV); % std. err.
vT = (sqrt(iN)*vBeta)./vStdErr; % t-statistics
[vBeta, vStdErr, vT]
and here is the Julia code
vY = randn(100, 1); # outcome variable
mX = randn(100, 4); # design matrix
iN = size(vY, 1); # sample size
vBeta = (vY\mX)'; # estimated coefficients
vE = vY - mX*vBeta; # residuals
dSigmaSq = vE'*vE/iN; # residual variance
mV = dSigmaSq[1,1].*(inv(mX'*mX)); # covariance matrix; dSigmaSq.*
vStdErr = diag(mV) # std. err.
vT = (sqrt(iN).*vBeta)./vStdErr # t-statistics
println([vBeta'; vStdErr'; vT']')
Note that Matlab knows how to print matrices without a call to the
println function. The main difference here is that Julia does not know that a 1×1 matrix is a scalar and issues a matrix multiplication conformability error, whereas Matlab simply switches to elementwise multiplication which is the mathematically justifiable default.
I have always thought that maintaining a code library is a great way to keep up your chops in a language you do not use everyday. A lot of people use listservs and coding fora for this, but code libraries have the advantage that over time you learn a lot about one particular topic or subject.
Recently my interest has turned to nonparametric statistical techniques, and for the most surprising reason, that nonparametric estimators can be visualized. Something about a smooth effects curve flanked by 95% confidence intervals has stuck in my imagination for a few months now. I have been plotting starting a new code library, but cannot make up my mind as to what programming language it should be in.
Here are the choices I considered, and the relative pros and cons.
- Matlab/Octave: Matlab is proprietary and given that I will not be in academia in the near future, this could be a short-lived choice. But coming from a programming language that is an also-ran when assessed the number of users metric, the large reach is enticing. Plus, Matlab ages slowly, so current versions are good for at least a few years. Matlab comes with excellent optimization libraries, and it is my aim to be able to explore those carefully should I write the library in Matlab. Matlab has great graphics and this is important for nonparametric statistics. The main con of Matlab is that it is slow and I am not very sure how it scales for large data.
My idea is to write first a fully Matlab version of the toolbox, as libraries are called in Matlab, and then fold in MEX code as hopefully I get better at it.
The Octave implementation is nice and also has a MEX-like system. I think the problem with Octave is that the quality of the toolboxes drops off fairly quickly.
- Python: I have only recently begun using Python. Computer scientists love it, and it is open source. Some economists have recently begun to write econometrics code for Python — most notably John Stachurski at ANU who provides a fully featured advanced undergraduate textbook in econometrics with Python code examples. The code I have come across and attempted to write has turned out very neat, and this is important to me. The problem here is that there is already a project very much of the nature that I had in mind being written here.
Another point in favor of Python is that it has great IDEs, including hooks for Visual Studio which I have recently discovered.
- R: Now you’d think this would be my first choice to write a statistics package. But I don’t like R (yet). I find the syntax unreadable, and the way functions are scattered around the vast numbers of packages impossible to keep track of. R is also (very) slow, although with C++ code folded in apparently it can be made much faster. I don’t know.
I have an ongoing project where I am translating the empirical examples from Wooldridge’s book to R, but I think that R is best suited as a scripting language leveraging the work others in providing pre-package routines.
I have access to the Revolution Analytics repackaging RevoR, and I like the IDE, but I remain unconvinced. R does have a fantastic library for nonparametric econometrics, the np package.
It is worthwhile spending some time making sure that the language chosen suits the project, because while switching is possible, it becomes more unlikely over time making dynamically inefficient equilibria likely.
For a number of reasons, hosting a blog on wordpress.com is not a very satisfactory option for me. Not only does it allow no changes to the CSS, installation of packages is severely limited. I have decided to move to my old domain, that I let lapse last year.
I intend to have it up and running by later today, once I have figured out which hosting plan to buy. The market for web-hosts is crowded and the product is not very well differentiated across sellers.
The workflow for today is roughly going to be:
- Buy web-hosting and set up website and wordpress blog.
- Download and compile OCaml for 64-bit Windows. This has a significant number of steps and might take a while.
- Code up the econometrics “Hello World” – generate some data and compute the regression coefficients, the standard errors and the t-statistics and write them to the console and to a text file.
- Blog about it on the new blog.
One of the advantages of having two machines running the identical OS is that when I configure one of the machines using a procedure with many steps, in addition to recording it in my notebook, I get to review the procedure while repeating it on the other machine, and so remembering it better.