Various probabilistic hashing/sketching algorithms in Clojure (bloom filters, min-hash, hyper-loglog, count-min). Useful for making compact and mergable summaries of streaming data. Handy for distributed systems and even for ML tasks (specifically min-hash).
I'm a co-founder and hacker at BigML since 2011. Before that I worked on recommender systems and applying Dynamic Bayesian Networks to assorted problems (mostly in the defense world). But recently at BigML I've spent my time implementing various ML techniques in a scalable fashion as part of our web service (trees, ensembles, gmeans and fancy kmeans, isolation forests, etc.). I also spend time working on visualizations and a few open source projects, which is what this page covers.
Open Source Projects
A few BigML open source projects where I've acted as the main contributer. They're all data-oriented, offering techniques for summarizing or sampling over large (or streaming) data sources.
A selection of my ever changing side projects. These ones I actually finished... or at least came close enough to be presentable!
My wife and I have been dreaming of Portland. So I collected oodles of data to get a better grasp of which areas are well connected to downtown, grocers, and schools. The resulting app lets you set the filters to find the best connected homes. Beware, the page takes a while to load and I've done almost no testing for mobile devices.
Grades each state and individual congressional district based on how badly the districts appear to be gerrymandered. Grades are calculated for both the 2009 (pre-census) and 2013 (post-census). The grades are found using the ratio of a district’s area to its convex hull. See the fancy viz and the code used to generate it.
Demos of some of the visualizations I've created at BigML. These all use the fantastic D3 library. Many of them are now in production, but others are still being polished for prime time. See more on my bl.ocks.org page.
A scatterplot that supports both numeric and categorical features, clumps together nearby points, uses color as a third dimension, and has pretty transistion animations to boot. The iris dataset is a classic example, but I think the abalone and autos examples are more interesting to explore.