GENTLE STROLL DOWN            THE ANALYTICS MEMORY            LANE            Abe Taha            VP Engineering, Karmapsh...
What is this talk about    • This talk is a story about building an analytics services team at      Ning and the experienc...
Caveat Lector    • The story has no pictures or conversations    • “And what is the use of a book," thought Alice, "withou...
Your storyteller    • Mostly scalable distributed systems background          •    At Yahoo–Search and Social Search      ...
Prologue    • The story begins at Ning    • Starting an analytics and systems management teams    • In 2008    • When Hado...
A bit about Ning    • Hot company at the time, co-founded by Andreessen    • Allowed users to build websites that look lik...
Raison d’etre for the analytics team    • Figure out what ads to display on the network          • Look at user generated ...
First: building the team    • Data scientist title not common then, second best engineers          • Distributed systems e...
Second: figuring out where the data is    • Typical company scenario          • Data resides in log files               • ...
Third: where to keep the data    • Wanted to keep all the historical data    • In a centralized place    • Without paying ...
V1: getting data in    • Minor changes to store all machine and application logs on NFS      drive          • A couple of ...
V1: now what    • Custom Java map-reduce programs to process the data    • Support libraries to parse different log file f...
V1: dashboarding    • Results stored in flat files in HDFS    • Grouped daily/weekly/monthly    • Use gnuplot to build das...
What did we learn from V1    • POC proved viability of Hadoop    • Latency of pulling files was an issue    • Most of the ...
V2: changing data ingestion    • Use event records instead of log files    • Pushed through HTTP    • Build using Thrift  ...
V2: collectors    • Receive events    • Put in a memory queue    • Background processes store to local disk    • Check eve...
V2: computation abstraction    • Common tasks          • Projection               • What fields am I interested in        ...
V2: better dashboards    • Metrics summarized in MySQL databases    • Interactive dashboards using Ruby/Senatra          •...
What did we learn from V2    • HDFS I/O is better than the local disk          • No need for the process that saves locall...
V3: ingestion    • Minor modifications          • Collectors now write to HDFS          • Collectors accumulate events to ...
V3: computation    • Need a higher level language for query          • JSON API exposing a search like query syntax       ...
V3: dashboards    • More intermediate data precomputed    • Data stored in Hbase    • Dashboards go against HBase    • Tem...
V3: What did we learn    • Self serve is the way to go    • Give people the infrastructure and the support libraries and  ...
How would we do it differently today    • Open source obviates custom code          •    Scribe for data ingestion        ...
Epilogue    • ML analysis showed most usage is spam    • Shutdown a lot of pr0n networks and video hosting networks in    ...
Takeaway    • The problems and solutions are mostly the same everywhere          • Getting data into Hadoop          • How...
Q&A27 © Karmasphere 2011 All rights reserved
The Leader in Big Data Intelligence on Hadoop                                            www.karmasphere.com28 © Karmasphe...
Upcoming SlideShare
Loading in...5
×

Chicago HUG Presentation Oct 2011

322

Published on

Chicago Hadoop User Group presentation at Orbitz, October 2011

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
322
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Chicago HUG Presentation Oct 2011

  1. 1. GENTLE STROLL DOWN THE ANALYTICS MEMORY LANE Abe Taha VP Engineering, Karmapshere Oct 19th, 20111 © Karmasphere 2011 All rights reserved
  2. 2. What is this talk about • This talk is a story about building an analytics services team at Ning and the experiences and lessons learned • There is also a bit about how I’d do things differently • And like a good story, an ending2 © Karmasphere 2011 All rights reserved
  3. 3. Caveat Lector • The story has no pictures or conversations • “And what is the use of a book," thought Alice, "without pictures or conversations?” Alice’s Adventures in Wonderland, Lewis Carroll3 © Karmasphere 2011 All rights reserved
  4. 4. Your storyteller • Mostly scalable distributed systems background • At Yahoo–Search and Social Search • At Google—App infrastructure • At Ning—Hadoop for Analytics and System Management services • At Ask—Dictionary/Reference properties • Now at Karmasphere building analytics applications on Hadoop4 © Karmasphere 2011 All rights reserved
  5. 5. Prologue • The story begins at Ning • Starting an analytics and systems management teams • In 2008 • When Hadoop was gaining popularity • v0.16 was out5 © Karmasphere 2011 All rights reserved
  6. 6. A bit about Ning • Hot company at the time, co-founded by Andreessen • Allowed users to build websites that look like Facebook • Websites called networks • Networks had social features • Blogs • Photos • Videos • Chat • Social graph • Each network had a major topic/category • Most networks were free, few for pay • Free networks monetized through contextual ads • The theory was that people produce good content that you can monetize6 © Karmasphere 2011 All rights reserved
  7. 7. Raison d’etre for the analytics team • Figure out what ads to display on the network • Look at user generated content (UGC) • Posts • Comments and discussions • Tags on photos and videos • Come up with categories for networks and ads • Model network trends and business metrics • Predict serving machine growth (poor man’s ec2) • Model machine and application data (poor man’s ec2) • Memory, disk, CPU, network • Application logs, counters, etc7 © Karmasphere 2011 All rights reserved
  8. 8. First: building the team • Data scientist title not common then, second best engineers • Distributed systems engineers (3) for the infrastructure • Statistics and ML engineers (2) for modeling and trending • Data visualization engineers (1) for building dashboards to interact with the data • Systems management engineers (2) for building the machine monitoring systems8 © Karmasphere 2011 All rights reserved
  9. 9. Second: figuring out where the data is • Typical company scenario • Data resides in log files • Machine or application logs • Stored locally • Purged after 30 days9 © Karmasphere 2011 All rights reserved
  10. 10. Third: where to keep the data • Wanted to keep all the historical data • In a centralized place • Without paying too much money • Or using specialized hardware • Ruled out DW • Had experience with systems that looked like Hadoop (or Hadoop looked like them) • Team wanted to experiment with newer technology • -> Data in Hadoop • V1: POC10 © Karmasphere 2011 All rights reserved
  11. 11. V1: getting data in • Minor changes to store all machine and application logs on NFS drive • A couple of retired NetApps filers • Log files copied into HDFS using the Hadoop client • Data organized by source in a directory hierarchy • Grouped by date • No preprocessing • 3x replication • Some latency in moving the data11 © Karmasphere 2011 All rights reserved
  12. 12. V1: now what • Custom Java map-reduce programs to process the data • Support libraries to parse different log file formats • Jobs did simple analytics • Averages • Network response times • User engagement • Trends per network • Active users • Pageviews • Most common/popular • Browsers, pages, queries • Indexing • Machine utilization • Simple scheduler to run jobs12 © Karmasphere 2011 All rights reserved
  13. 13. V1: dashboarding • Results stored in flat files in HDFS • Grouped daily/weekly/monthly • Use gnuplot to build dashboards every hour13 © Karmasphere 2011 All rights reserved
  14. 14. What did we learn from V1 • POC proved viability of Hadoop • Latency of pulling files was an issue • Most of the metrics computations are of the same nature • People need flexibility in defining what is measured • Once you put data in front of people, they ask more questions • POC shows which areas are a pain, and where to invest to fix14 © Karmasphere 2011 All rights reserved
  15. 15. V2: changing data ingestion • Use event records instead of log files • Pushed through HTTP • Build using Thrift • Events have • Names • Timestamps • Host • Version • Payloads • Published catalog • All available events • Event parsers • Load ~50 million external page views (~10 events per page)15 © Karmasphere 2011 All rights reserved
  16. 16. V2: collectors • Receive events • Put in a memory queue • Background processes store to local disk • Check events for validity against catalog • Separate into valid/invalid queues • Another process sucks data into HDFS and organize in a directory hierarchy • Events • Grouped by date16 © Karmasphere 2011 All rights reserved
  17. 17. V2: computation abstraction • Common tasks • Projection • What fields am I interested in • Filtering • What records I am interested in • Aggregations • What do I want to do with the metrics • Common readers and writers for data types • Captured in libraries that can be composed for complex analytics17 © Karmasphere 2011 All rights reserved
  18. 18. V2: better dashboards • Metrics summarized in MySQL databases • Interactive dashboards using Ruby/Senatra • Select metrics • Time range • Aggregation method • Plot results using FusionCharts • OpenCharts was a close second, but no combined charts (Histograms, line charts)18 © Karmasphere 2011 All rights reserved
  19. 19. What did we learn from V2 • HDFS I/O is better than the local disk • No need for the process that saves locally then to HDFS • People loved events • Led to event abuse • Each feature on the page had an associated event • Events were used for performance tuning: how much time did a feature take • Events were used for monitoring backend features: record errors with services • Large number of files cause problems for the namenode • Need to coalesce events to reduce file number • With flexible event types, and interactive dashboards, people have more questions • We couldn’t keep up with developing custom metrics and charts • Needed a self serve query mechanism19 © Karmasphere 2011 All rights reserved
  20. 20. V3: ingestion • Minor modifications • Collectors now write to HDFS • Collectors accumulate events to reduce file number • Self serve UI for defining new events outside of the metrics team20 © Karmasphere 2011 All rights reserved
  21. 21. V3: computation • Need a higher level language for query • JSON API exposing a search like query syntax • {from: ‘date’, to: ‘date’, metric:’x’, computation} • Computations are encapsulated into libraries and exposed through JSON • Users can add metrics and computations and build frontends for the query language • Custom code for ML tasks • Cascading for algorithms • R for visualization21 © Karmasphere 2011 All rights reserved
  22. 22. V3: dashboards • More intermediate data precomputed • Data stored in Hbase • Dashboards go against HBase • Templates for users to build custom dashboards22 © Karmasphere 2011 All rights reserved
  23. 23. V3: What did we learn • Self serve is the way to go • Give people the infrastructure and the support libraries and they’ll go to town • Some tasks still can’t be done in a framework and needs custom code • Machine learning, with analysis on R • ML is hard, even with experience • Data is not clean • Some content is very small • Comments on pictures and videos (workarounds for aggregation) • Even then you can build products around the results • People and network recommenders • Network categories for ads23 © Karmasphere 2011 All rights reserved
  24. 24. How would we do it differently today • Open source obviates custom code • Scribe for data ingestion • Hive for self serve analytics and business intelligence • Pig scripts subsume most of the Java code • Cascading for Java map-reduce • Dashboards still stay the same24 © Karmasphere 2011 All rights reserved
  25. 25. Epilogue • ML analysis showed most usage is spam • Shutdown a lot of pr0n networks and video hosting networks in far east Asia • Team moved to different companies • Still in analytics at LI, FB, and twitter • Company changed business model to for pay only and laid off half the staff 6 months later • Company acquired recently25 © Karmasphere 2011 All rights reserved
  26. 26. Takeaway • The problems and solutions are mostly the same everywhere • Getting data into Hadoop • How do you compute over the data • Getting meaningful data out of Hadoop • Lots of software components exist to help you with these • It is about the balance of what you develop vs what you acquire26 © Karmasphere 2011 All rights reserved
  27. 27. Q&A27 © Karmasphere 2011 All rights reserved
  28. 28. The Leader in Big Data Intelligence on Hadoop www.karmasphere.com28 © Karmasphere 2011 All rights reserved

×