Not being a gamer, I was not aware that some gaming keyboards, in particular, my recently acquired Cooler Master Quickfire Rapid I, have a Windows key lock that prevent exiting from gaming mode in case that the Windows key is pressed accidentally. I use the Windows key quite frequently on Ubuntu to emulate the Windows “Snap to Sides” feature that maps Ctrl+Win+RightArrow to snap a window to the right of the screen, for example. I was not able to figure out why it was not working with my new keyboard. Turns out the Windows key lock is turned on by default, and can be turned off using Fn+PrtSc.
I spent some amount of time trying to figure out if I could get Google Analytics to track the traffic on my WordPress.com account, but it seems that this is only a feature available using WordPress Business plan, or if I migrate the site to WordPress.org.
I also spent some time figuring out how to set up a new account and property within a Google Analytics account. And then there is the question of configuring useful views of the data within a property (Account > Property > Views). For now, I have created two accounts within my GA account, one for my main site and the other for all the blogs I have on services such as tumblr and this site on WordPress.com.
I googled and put together a lightly curated set of links that I want to read to collate a strategy on how to prepare for a data science interview.
- http://career.guru99.com/top-50-interview-questions-on-machine-learning/: This link appears to have random questions, and the answers lack any depth whatsoever, but it appears to cover topics that are unlikely to come up in interviews, would be good to know about superficially.
- http://robertheaton.com/2014/03/07/lessons-from-a-silicon-valley-job-search/: This is not about machine learning in particular, but I liked the detailed writeup on how to structure a job search.
Since I will be answering questions on programming using Python, here are some links on Python programming interview questions:
I must admit that I don’t really like the look of some of these links as they seem too focused on syntax or clever and compact ways of achieving trivial things, and not on the language big picture, architecture of applications or actually solving difficult problems. In addition, these don’t cover some of the core packages that a data scientist should know, such as NumPy, SciPy or scikit-learn, and to a lesser extent pandas and statsmodels.
For reference, here is my implementation of FizzBuzz.
for x in range(101): if x % 15 == 0: print("FizzBuzz") elif x % 3 == 0: print("Fizz") elif x % 5 == 0: print("Buzz") else: print(x)
This is the vanilla implementation that can be found all over the internet, and the trick really is to handle division modulo 15 first rather than last, as it reads in the question. Everything else is standard syntax. At some point, I have implemented it using generators, but I can’t find that solution, and I don’t want to get distracted from pulling together links for my interview prep.
Over the next couple of days I intend to work through these links, and I will see if I can refine down to a core strategy that I can use to preapre and review before interview. I also intend to blog about some of the peripheral aspects of job search such as writing a good resume and cover letter, interview etiquette and searching for relevant jobs.
Spark is the flavour of the season when it comes to distributed computing frameworks, and I have been caught up in the excitement. I have narrowed down the set of Spark resources to three, which I am going to try and use over the next 3 months to try and learn Spark:
- The first is Advanced Analytics with Spark, which is hot off the presses, but has already gotten very good reviews on Amazon. Not surprising given that Sean Owen is one of the authors.
- The second, of which I have already the first few chapters, but intend to systematically re-read over the next month, is Learning Spark. I must admit, I am somewhat intrigued by how much Matei Zaharia has managed to achieve at (what I am guessing is) a relatively young age.
- Lastly, there is a brand new edX.org course on Spark that has just started. It is a little disappointing that the course uses Python 2.7 and Spark 1.3, instead of moving to the impending release of Spark 1.4, which also supports SparkR and Python 3.4. I think that the Spark guys are holding off on releasing the latest version of Spark till the Spark summit later this month.
In any case, Spark is an important new technology for data analysis, and significant improvement over the disk-only storage model of Hadoop MapReduce. That is not to say that Spark is the only in-memory distributed computing framework that can do ad hoc querying, machine learning, and graph processing on big data — there is the newly promoted to top-level project Apache Flink, but Spark definitely appears to have a significant head start. I look forward to learning more about Apache Spark.
It has been a long time since I blogged here, so here is a quick update:
- Coffee is still as important to me as it was, but I no longer forget to consume it as part of my daily routine.
- I still struggle with typing. I have tried several different keyboards over the past 3 years, including the Microsoft Ergonomic 4000 (of which I have gone through 3), the Cooler Master Rapid-i tenkeyless, which I got recently, and the Logitech K400r.
- I have been constantly reminded of Julia sporadically over the years, as people experiment with it, and as the tooling and support for Julia has gotten better over. The second JuliaCon is going to be held in Cambridge, Mass this month. I have not experimented with Julia since the time the last blog post mentioning Julia was written on this blog.
- I still struggle with the pros and cons of multiple computing devices, and over the last 3 years, smartphones have only added to that struggle. I now own 4 (!) laptops, 2 mobile phones, and a (defunct) desktop PC. 3 of the 4 laptops now run Ubuntu (14.04, 14.10 and 15.04 (!)), and the other one runs Windows 7. One of my phones runs Android Lollipop, and the other runs Windows 8.1 (looking forward to the Windows 10 update).
I constantly forget which machines I have updated and which ones contain a particular piece of software. I have myriad unwritten rules about which machines are to be used for what, but most of those are compromised in favor of just doing what is most convenient.
- I did buy the domain, and set up a new blog on that page, but that blog never contained anything interesting at all, and I eventually shut it down, but not before it was auto-renewed for a year, and I vowed to use it for something productive, but didn’t.
- I never got anywhere with OCaml or F# or any of the other functional programming languages that I wanted to learn at that point. I don’t think I ever tried after writing that blog post.
- I did end up investing a fair amount of effort in learning how to, and maintaining a couple of R and Python package projects. They are not terribly complex, but it was helpful to understand the process of package creation, documentation, and distribution. For those who are looking to create R packages, even though RStudio makes packaging R code extremely simple, might benefit from Hadley Wickham’s excellent new book R Packages.
- I completely abandoned trying to work with Matlab or Octave since I discovered NumPy and I even thought about translating the best known example of the use of Matlab — Andrew Ng’s machine learning course on Coursera — using NumPy and SciPy, but have not gotten around to it yet.
- I would say that I am better at R than I used to be. I will leave it at that. I did manage to do a lot of work that integrates C++ code with R using Rcpp.
- I never took up the work on understanding reinforcement learning and dynamic programming, even though I have had problems that I could potentially solve using those methods.
I have some idea of dynamic programming problems, based on my graduate level macroeconomics courses. But what I am trying to figure out are the kinds of machine learning problems that DP is most effective in providing insights into, and the kinds of insights that DP can provide.
I aim to collect some simple examples over the next few weeks that demonstrates the things that you can get out of setting up problems as reinforcement learning or DP problems. Stay tuned.
The similarity of the Julia programming language to Matlab and its syntax makes it very easy to translate simple Matlab programs into Julia code. The following code shows simulating linear regression parameter and standard error estimation in Julia and Matlab. The superficial similarity of the code is remarkable.
First, here is the Matlab code:
clc; clear; vY = randn(100, 1); % outcome variable mX = randn(100, 4); % design matrix iN = size(vY, 1); % sample size vBeta = (vY\mX)'; % estimated coefficients vE = vY - mX*vBeta; % residuals dSigmaSq = vE'*vE/iN; % residual variance mV = dSigmaSq.*(inv(mX'*mX)); % covariance matrix vStdErr = diag(mV); % std. err. vT = (sqrt(iN)*vBeta)./vStdErr; % t-statistics [vBeta, vStdErr, vT]
and here is the Julia code
vY = randn(100, 1); # outcome variable mX = randn(100, 4); # design matrix iN = size(vY, 1); # sample size vBeta = (vY\mX)'; # estimated coefficients vE = vY - mX*vBeta; # residuals dSigmaSq = vE'*vE/iN; # residual variance mV = dSigmaSq[1,1].*(inv(mX'*mX)); # covariance matrix; dSigmaSq.* vStdErr = diag(mV) # std. err. vT = (sqrt(iN).*vBeta)./vStdErr # t-statistics println([vBeta'; vStdErr'; vT']')
Note that Matlab knows how to print matrices without a call to the
println function. The main difference here is that Julia does not know that a 1×1 matrix is a scalar and issues a matrix multiplication conformability error, whereas Matlab simply switches to elementwise multiplication which is the mathematically justifiable default.