• 2013-02-11

    Building an Analytics Portal from Scratch, Part 2 

    This is the second of a multi-part series written by our lead data scientist, Eli Finkelshteyn, on how and why we built the Backplane analytics portal from scratch. Be sure to check out part 1 as well.


    Thanks to everyone who sent me positive feedback about my first blog post. I really appreciated all of it, and it made writing this next part a much better experience for me. — Eli

    Now that you know why we wanted to move the bulk of our analytics and logging off Google Analytics and Mixpanel and onto our own platform, here is what we did instead.

    Whatever solution we chose, it would need to jive with the following list of priorities:

    1. All logging data must be saved and be in a scalable format for new data and schema necessities.
    2. All data must be warehoused somewhere reliable and easily accessible.
    3. There must be flexible processing and analytics storage layers. We’re going to want to process and slice our data into new charts and visualizations in real time (or in batches, depending on requirements).
    4. The solution must be reasonably fast and horizontally scalable—we need to be able to handle any future expected load without dying. Success for us means that if the number of active users on our site jumps by more than 100x, the site won’t crash and we don’t have to spend all of our time putting out fires.
    5. It must be built in about a month of my time. When I was building this, I was the one data scientist Backplane had, and this was just one of my tasks. Start-up mode means no one can go off and spend months perfecting pet projects for fun. Done is more important than perfect.

    The system I built accomplished all of these things, and none of the boxes it’s on have so much as hiccupped in the 5 months since. That’s the best endorsement I can give.

    The Logging Layer

    For the logging layer, the data needed to be

    1. …stored in JSON (as opposed to something like comma-delimited where the schema is external to the data). What I was logging would probably go through several changes, and the logs needed to be backwards compatible with regards to processing.
    2. …periodically saved to s3 so I wouldn’t need to worry about losing it, or accessing it. Luckily, we could do that for pretty cheap.
    3. …able to be parallelized, so that I can tackle fires by throwing in more boxes and have everything decentralized. One box dying shouldn’t break the system.

    Research

    I researched Scribe, Flume OG and NG, and Kafka.

    • Scribe: I worried about Scribe because it’s not a central part of Facebook’s business, isn’t extensively documented, and didn’t seem like it had a very active user community. It’s also no longer an active project at Facebook, which is hugely scary in case we found bugs or needed updates.
    • Flume was in a middle state between transitioning from OG to NG which did not sound like they were compatible. NG wasn’t quite finished or being used by many people, and building something on OG knowing full well it would soon not be supported made no sense.
    • Kafka had a small user community and could very well go the way of scribe and become unsupported by LinkedIn at any time. The documentation also seemed poor. As a sidebar, this guy who did decide to go with Kafka has made some good points on the topic.

    On top of all this, I was spending a huge amount of time hacking and googling just to try to get something basic working with any of these solutions. Forcing these solutions to store data in JSON or to save to s3 sounded like a big pain, especially since the user communities were, again, not very responsive.

    FluentD

    That’s when I ran into FluentD. It’s a new competitor in this market, but it allows me to store everything in JSON by default. And with its plethora of plugins, I can have it do writes from most common languages, write to s3, and any number of other tasks. And because it is written in very clean Ruby code, I can easily modify it myself if I ever need to. On top of all of that, the authors actively answer questions in their user community.

    With all that going for it, I had a prototype set up and writing the data I needed in JSON to s3 in mere hours. I had been skeptical of using a product that was so new and untested, but the fact that it was capable of doing everything I wanted and do it so quickly — it was refreshing after spending days of frustrating research trying to bend the other solutions to our needs.

    So, when I set up a handful of aggregator boxes and load-tested the hell out of it by sending traffic at about 500x what we were currently receiving. FluentD withstood it all, no problem. I was convinced.

    Here’s a sketch of our logging layer with FluentD: image

    The data is written from each of our app servers to a couple of aggregator boxes that then aggregate the data into a few larger files. Those files are then uploaded to s3 on an hourly basis for safe keeping. Simultaneous to that, the aggregator boxes also stream the data they receive to a processing layer for real-time processing.

    Logging Security

    One of the concerns brought up in the last article was about logging security. We haven’t found a silver bullet here, but we have been able to make things a lot more secure and reliable than what we had with Google and MixPanel. We do this by bucketing logs depending on how much we trust them.

    Log Bucket 1: Backend

    The most trusted bucket holds the logs generated from our backend. For example, when a user goes in and “likes” a post, a log is generated on our backend only when that action actually happened. Someone could create a bot that goes around liking a bunch of posts to spam our site, but our logs would still be accurate—those “likes” actually did happen (plus, protecting against bots is a completely separate problem).

    Compare this to using only Google or Mixpanel, where someone could just spoof the front end calls for “likes” to make it look like a post got liked a lot when in reality it didn’t. That’s a much bigger data validity problem because if that happens even once, it means we can’t trust any of our data anymore since we don’t know where or when such a breach might have happened before.

    Log Bucket 2: Logged-in Users

    The second most trusted bucket is front-end logs generated from logged-in users. These are bucketed separately from our backend logs in case problems do occur, so they don’t contaminate the backend logs.

    The idea for these logs is on every front-end log call we get, we check to make sure that where they’re being sent from has a valid logged-in session, and we know the user ID and the IP address. We throw out outliers on both of these fronts, so if an abnormally high amount of activity is coming from any single IP, or any single logged in user, it’ll be automatically thrown out in processing. That means if someone wants to give us bad data here, they have to create a bot that uses a large number of spoofed IPs, creates a bunch of spoofed user accounts, and pays for cracking CAPTCHAs on all those.

    There are also a number of other checks we do on these, but those are more for security through obfuscation. So, even after going to all of the trouble mentioned above, a potential hacker still wouldn’t know if their hack was successful since they never see our data.

    Log Bucket 3: Everyone else

    The final bucket is front-end logging for users who aren’t logged in. This is a really small part of our logging, and we try not to use it whenever possible. Still, when it’s absolutely necessary, it is completed by doing all of the checks above except for the username checking. Data going here is also bucketed separately so it doesn’t contaminate any of the other logs.

    I should note that even our least secure layer of logging (which we try not to use) is still better than logging through Google Analytics and Mixpanel since we throw out the outliers and have full access to all of our own logs. Thus, if something suspicious happens, we can easily and immediately check it out.


    The next part in this series on our analytics portal will cover Processing and Visualization. Stay Tuned!

    1. ztratar likes this
    2. ghse likes this
    3. thebackplane posted this