Building An Analytics Portal from Scratch, Part 1
This is the first of a multi-part series written by our lead data scientist, Eli Finkelshteyn, on how and why we built the Backplane analytics portal from scratch.
Before I joined the team, the Backplane was using both Google Analytics and Mixpanel for front-end logging. While I liked both of these — they were definitely nice, quick solutions for getting a basic idea of what’s happening on our site — I saw a few problems:
- These solutions have no security.
Both apps use simple, easily emulated front-end calls to send data. A quick search of StackOverflow is all you need to figure out how to scam this.
Because of that, if someone wanted to send us (or anyone else using these tools) a bunch of bad data, there’s nothing really stopping them. Sure, that’s a pretty boring hack and the hacker will never really see the results, but it is still easily doable, and that’s scary.
- We want to keep all our logged data.
Initially, this wasn’t such a big concern, but now that we’re getting bigger and have the resources, warehousing our data instead of just computing our analytics from it is a big win.
Having this data means we can log everything now, then go back and ask questions later. If we think of a new metric, now, we don’t need waste more time collecting new data before being able to analyze a trend. We can simply run a quick pig script on the old data over EMR, and have aggregated numbers and a graph in minutes. This is in addition to being able to use our old logging data for machine-learning gold sets when we happen upon an awesome new idea for a recommendation system or a classifier.
I know I’m not the first to say it, but old logging data is gold. Don’t just throw it away.
- We want to be able to graph whatever we want however we want.
For basic analytics, Google Analytics and Mixpanel are great. If you’re at a start-up with just a handful of employees, these two tools can get you far enough. Plus investing the time necessary to build out your own analytics tools before you’ve got a great product or users is a bad idea.
For Backplane, though, now that we have both of these, we’re starting to get smarter about what metrics we’re looking at. Many of the statistics that interest us, we can’t easily get out of Google Analytics or Mixpanel. If we want to know how many registered users came to a site at least 2 days out of the last week (to get a sense of how many regulars we have), we’d be out of luck without our internal tools. And that’s just the tip of the iceberg.
- It’s not cost effective.
This surprised to me. Backplane has over 700,000 registered users, along with more than a hundred thousand lurkers. With so many users, we have several metrics we were interested in and that we’re sending data points for.
As an example, we sent 34 million data points in September of 2012. Mixpanel’s pricing plan charges $2000/month for 20 million data points, so you can imagine that we were paying a significant amount more. Because we’ve grown even more since then, these bills were only going to keep growing.
Conversely, to do the same thing ourselves costs us about $600/month in equipment costs, about 1 month of a single developer’s time up front, and maybe 1 day out of the month of developer maintenance time tops. This gives us a robust, secure setup that can handle a few million users at current typical usage. That’d be half the price if we were sending only 20 million data points, and, like I said, we’re sending way more.
The next part in this series on our analytics portal will cover how we setup logging and data warehousing and why we did it the way we did it. Stay tuned!