Monday, March 14, 2016

Simple MLP with Torch 7

I've been busy working on my capstone project lately. The goal of my project is to be able to distinguish between good and bad data generated by the XENON100 dark matter experiment. Researchers at the collaboration have been using some simple cut based methods to get rid of noisy data, but as the experiment has aged these methods have become really inefficient. In other words, they've been throwing away a lot of data that could include the signal from a dark matter interaction event. Long story short, I'm using some machine learning techniques to categorize data into "good" and "bad" categories. A method called boosted decision tree (BDT) has proven to be pretty effective at this task in the past, and I'm using an implementation bundled in this piece of software called TMVA. TMVA has been used for data analysis in a variety of high energy physics experiments. It's pretty effing cool because it includes many classification algorithms, including, but not limited to SVM (support vector machine), MLP (multilayer perceptron), BDT, other decision tree methods, and a bunch of stuff I've never heard of. Skip the next bit if you don't want to read my rant on ROOT. It's awful because it's all built in ROOT. I don't want to learn ROOT. I'm a Python programmer, with some experience with Java and JavaScript. I see all those asterisks, arrows, and character arrays and my eyes glaze over. I've tried to write some basic ROOT "macros" but they throw these cryptic errors that don't make a damn bit of sense. You don't realize how spoiled you are with Python until you use something like ROOT. The worst shit is the interface between Python and ROOT. There is a pretty decent library called rootpy, but the documentation and source code is all but impenetrable (500 line pieces of code just to bind the ROOT histogram class to Python ???). When rootpy works, it is pretty Pythonic, but it doesn't allow for a ton of flexibility. Say you run some stuff through TMVA, and you want to grab and manipulate some histograms that are stored in root files. Forget about it. You just can't.

I should note that ROOT has some very powerful tools for data analysis, and it's pretty darn fast, as it's built right on top of C++. If part of my capstone were learning ROOT, I think I'd be a little less critical, but right now I see it as this obnoxious obstacle to me doing my analysis.

Anyways, I thought it would be cool to see if I could dump some of the data I've been creating into another machine learning framework and see if I can categorize my signals. In the past, I've used Theano, and found it to be really difficult. Even the tutorial on the most basic neural net architecture (MLP) stymied me. Even though I implemented my own version of a MLP, I still felt very uncomfortable with Theano. Lately, I've moved to Torch. Torch is awesome. Below is the code for setting up and training my MLP.


Okay, okay, I'll admit the line for setting up the training seems pretty black box. Implementing something similar to the built in stochastic gradient descent is not that tricky though. (I'll show some code that does it soon). Basically you chop up your training dataset into chunks (minibatches), feed it through the net, calculate gradients, and update weight matrices. Torch is built on Lua. Lua is pretty straight forward. Lua isn't as widely divulged as Python, so there isn't as much help on line, but it's not terrible. Torch's documentation is on github, and like matplotlib, is contained in one big page. This is a bit obnoxious, but not the end of the world. The biggest issue with Torch at the end of the day is loading and creating datasets. The code that loads in my data (stored in csv files) is about twice as long as the code to build the MLP. Once you have it though, the neural net is almost trivial (at least compared to Theano). As I delve deeper into this stuff, I'm sure I'll find that Torch is every bit as complicated as Theano, but I'm pretty happy for the moment.