Chicago HUG Presentation Oct 2011

GENTLE STROLL DOWN
THE ANALYTICS MEMORY
LANE
Abe Taha
VP Engineering, Karmapshere
Oct 19th, 2011

1 © Karmasphere 2011 All rights reserved

What is this talk about

• This talk is a story about building an analytics services team at
Ning and the experiences and lessons learned
• There is also a bit about how I’d do things differently
• And like a good story, an ending


Caveat Lector

• The story has no pictures or conversations
• “And what is the use of a book," thought Alice, "without
pictures or conversations?”

Alice’s Adventures in Wonderland, Lewis Carroll

Your storyteller

• Mostly scalable distributed systems background
• At Yahoo–Search and Social Search
• At Google—App infrastructure
• At Ning—Hadoop for Analytics and System Management services
• At Ask—Dictionary/Reference properties
• Now at Karmasphere building analytics applications on Hadoop


Prologue

• The story begins at Ning
• Starting an analytics and systems management teams
• In 2008
• When Hadoop was gaining popularity
• v0.16 was out


A bit about Ning

• Hot company at the time, co-founded by Andreessen
• Allowed users to build websites that look like Facebook
• Websites called networks
• Networks had social features
• Blogs
• Photos
• Videos
• Chat
• Social graph
• Each network had a major topic/category
• Most networks were free, few for pay
• Free networks monetized through contextual ads
• The theory was that people produce good content that you can
monetize


Raison d’etre for the analytics team

• Figure out what ads to display on the network
• Look at user generated content (UGC)
• Posts
• Comments and discussions
• Tags on photos and videos
• Come up with categories for networks and ads
• Model network trends and business metrics
• Predict serving machine growth (poor man’s ec2)
• Model machine and application data (poor man’s ec2)
• Memory, disk, CPU, network
• Application logs, counters, etc


First: building the team

• Data scientist title not common then, second best engineers
• Distributed systems engineers (3) for the infrastructure
• Statistics and ML engineers (2) for modeling and trending
• Data visualization engineers (1) for building dashboards to interact
with the data
• Systems management engineers (2) for building the machine
monitoring systems


Second: figuring out where the data is

• Typical company scenario
• Data resides in log files
• Machine or application logs
• Stored locally
• Purged after 30 days


Third: where to keep the data

• Wanted to keep all the historical data
• In a centralized place
• Without paying too much money
• Or using specialized hardware
• Ruled out DW
• Had experience with systems that looked like Hadoop (or
Hadoop looked like them)
• Team wanted to experiment with newer technology
• -> Data in Hadoop
• V1: POC


V1: getting data in

• Minor changes to store all machine and application logs on NFS
drive
• A couple of retired NetApps filers
• Log files copied into HDFS using the Hadoop client
• Data organized by source in a directory hierarchy
• Grouped by date
• No preprocessing
• 3x replication
• Some latency in moving the data


V1: now what

• Custom Java map-reduce programs to process the data
• Support libraries to parse different log file formats
• Jobs did simple analytics
• Averages
• Network response times
• User engagement
• Trends per network
• Active users
• Pageviews
• Most common/popular
• Browsers, pages, queries
• Indexing
• Machine utilization
• Simple scheduler to run jobs


V1: dashboarding

• Results stored in flat files in HDFS
• Grouped daily/weekly/monthly
• Use gnuplot to build dashboards every hour


What did we learn from V1

• POC proved viability of Hadoop
• Latency of pulling files was an issue
• Most of the metrics computations are of the same nature
• People need flexibility in defining what is measured
• Once you put data in front of people, they ask more questions
• POC shows which areas are a pain, and where to invest to fix


V2: changing data ingestion

• Use event records instead of log files
• Pushed through HTTP
• Build using Thrift
• Events have
• Names
• Timestamps
• Host
• Version
• Payloads
• Published catalog
• All available events
• Event parsers
• Load ~50 million external page views (~10 events per page)

V2: collectors

• Receive events
• Put in a memory queue
• Background processes store to local disk
• Check events for validity against catalog
• Separate into valid/invalid queues
• Another process sucks data into HDFS and organize in a
directory hierarchy
• Events
• Grouped by date


V2: computation abstraction

• Common tasks
• Projection
• What fields am I interested in
• Filtering
• What records I am interested in
• Aggregations
• What do I want to do with the metrics
• Common readers and writers for data types
• Captured in libraries that can be composed for complex
analytics


V2: better dashboards

• Metrics summarized in MySQL databases
• Interactive dashboards using Ruby/Senatra
• Select metrics
• Time range
• Aggregation method
• Plot results using FusionCharts
• OpenCharts was a close second, but no combined charts
(Histograms, line charts)


What did we learn from V2

• HDFS I/O is better than the local disk
• No need for the process that saves locally then to HDFS
• People loved events
• Led to event abuse
• Each feature on the page had an associated event
• Events were used for performance tuning: how much time did a feature
take
• Events were used for monitoring backend features: record errors with
services
• Large number of files cause problems for the namenode
• Need to coalesce events to reduce file number
• With flexible event types, and interactive dashboards, people have
more questions
• We couldn’t keep up with developing custom metrics and charts
• Needed a self serve query mechanism


V3: ingestion

• Minor modifications
• Collectors now write to HDFS
• Collectors accumulate events to reduce file number
• Self serve UI for defining new events outside of the metrics
team


V3: computation

• Need a higher level language for query
• JSON API exposing a search like query syntax
• {from: ‘date’, to: ‘date’, metric:’x’, computation}
• Computations are encapsulated into libraries and exposed
through JSON
• Users can add metrics and computations and build frontends for
the query language
• Custom code for ML tasks
• Cascading for algorithms
• R for visualization


V3: dashboards

• More intermediate data precomputed
• Data stored in Hbase
• Dashboards go against HBase
• Templates for users to build custom dashboards


V3: What did we learn

• Self serve is the way to go
• Give people the infrastructure and the support libraries and
they’ll go to town
• Some tasks still can’t be done in a framework and needs custom
code
• Machine learning, with analysis on R
• ML is hard, even with experience
• Data is not clean
• Some content is very small
• Comments on pictures and videos (workarounds for aggregation)
• Even then you can build products around the results
• People and network recommenders
• Network categories for ads


How would we do it differently today

• Open source obviates custom code
• Scribe for data ingestion
• Hive for self serve analytics and business intelligence
• Pig scripts subsume most of the Java code
• Cascading for Java map-reduce
• Dashboards still stay the same


Epilogue

• ML analysis showed most usage is spam
• Shutdown a lot of pr0n networks and video hosting networks in
far east Asia
• Team moved to different companies
• Still in analytics at LI, FB, and twitter
• Company changed business model to for pay only and laid off
half the staff 6 months later
• Company acquired recently


Takeaway

• The problems and solutions are mostly the same everywhere
• Getting data into Hadoop
• How do you compute over the data
• Getting meaningful data out of Hadoop
• Lots of software components exist to help you with these
• It is about the balance of what you develop vs what you acquire


Q&A


The Leader in Big Data Intelligence on Hadoop

www.karmasphere.com

Chicago HUG Presentation Oct 2011

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Chicago HUG Presentation Oct 2011

Similar to Chicago HUG Presentation Oct 2011 (20)

Recently uploaded

Recently uploaded (20)

Chicago HUG Presentation Oct 2011