Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chicago HUG Presentation Oct 2011


Published on

Chicago Hadoop User Group presentation at Orbitz, October 2011

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Chicago HUG Presentation Oct 2011

  1. 1. GENTLE STROLL DOWN THE ANALYTICS MEMORY LANE Abe Taha VP Engineering, Karmapshere Oct 19th, 20111 © Karmasphere 2011 All rights reserved
  2. 2. What is this talk about • This talk is a story about building an analytics services team at Ning and the experiences and lessons learned • There is also a bit about how I’d do things differently • And like a good story, an ending2 © Karmasphere 2011 All rights reserved
  3. 3. Caveat Lector • The story has no pictures or conversations • “And what is the use of a book," thought Alice, "without pictures or conversations?” Alice’s Adventures in Wonderland, Lewis Carroll3 © Karmasphere 2011 All rights reserved
  4. 4. Your storyteller • Mostly scalable distributed systems background • At Yahoo–Search and Social Search • At Google—App infrastructure • At Ning—Hadoop for Analytics and System Management services • At Ask—Dictionary/Reference properties • Now at Karmasphere building analytics applications on Hadoop4 © Karmasphere 2011 All rights reserved
  5. 5. Prologue • The story begins at Ning • Starting an analytics and systems management teams • In 2008 • When Hadoop was gaining popularity • v0.16 was out5 © Karmasphere 2011 All rights reserved
  6. 6. A bit about Ning • Hot company at the time, co-founded by Andreessen • Allowed users to build websites that look like Facebook • Websites called networks • Networks had social features • Blogs • Photos • Videos • Chat • Social graph • Each network had a major topic/category • Most networks were free, few for pay • Free networks monetized through contextual ads • The theory was that people produce good content that you can monetize6 © Karmasphere 2011 All rights reserved
  7. 7. Raison d’etre for the analytics team • Figure out what ads to display on the network • Look at user generated content (UGC) • Posts • Comments and discussions • Tags on photos and videos • Come up with categories for networks and ads • Model network trends and business metrics • Predict serving machine growth (poor man’s ec2) • Model machine and application data (poor man’s ec2) • Memory, disk, CPU, network • Application logs, counters, etc7 © Karmasphere 2011 All rights reserved
  8. 8. First: building the team • Data scientist title not common then, second best engineers • Distributed systems engineers (3) for the infrastructure • Statistics and ML engineers (2) for modeling and trending • Data visualization engineers (1) for building dashboards to interact with the data • Systems management engineers (2) for building the machine monitoring systems8 © Karmasphere 2011 All rights reserved
  9. 9. Second: figuring out where the data is • Typical company scenario • Data resides in log files • Machine or application logs • Stored locally • Purged after 30 days9 © Karmasphere 2011 All rights reserved
  10. 10. Third: where to keep the data • Wanted to keep all the historical data • In a centralized place • Without paying too much money • Or using specialized hardware • Ruled out DW • Had experience with systems that looked like Hadoop (or Hadoop looked like them) • Team wanted to experiment with newer technology • -> Data in Hadoop • V1: POC10 © Karmasphere 2011 All rights reserved
  11. 11. V1: getting data in • Minor changes to store all machine and application logs on NFS drive • A couple of retired NetApps filers • Log files copied into HDFS using the Hadoop client • Data organized by source in a directory hierarchy • Grouped by date • No preprocessing • 3x replication • Some latency in moving the data11 © Karmasphere 2011 All rights reserved
  12. 12. V1: now what • Custom Java map-reduce programs to process the data • Support libraries to parse different log file formats • Jobs did simple analytics • Averages • Network response times • User engagement • Trends per network • Active users • Pageviews • Most common/popular • Browsers, pages, queries • Indexing • Machine utilization • Simple scheduler to run jobs12 © Karmasphere 2011 All rights reserved
  13. 13. V1: dashboarding • Results stored in flat files in HDFS • Grouped daily/weekly/monthly • Use gnuplot to build dashboards every hour13 © Karmasphere 2011 All rights reserved
  14. 14. What did we learn from V1 • POC proved viability of Hadoop • Latency of pulling files was an issue • Most of the metrics computations are of the same nature • People need flexibility in defining what is measured • Once you put data in front of people, they ask more questions • POC shows which areas are a pain, and where to invest to fix14 © Karmasphere 2011 All rights reserved
  15. 15. V2: changing data ingestion • Use event records instead of log files • Pushed through HTTP • Build using Thrift • Events have • Names • Timestamps • Host • Version • Payloads • Published catalog • All available events • Event parsers • Load ~50 million external page views (~10 events per page)15 © Karmasphere 2011 All rights reserved
  16. 16. V2: collectors • Receive events • Put in a memory queue • Background processes store to local disk • Check events for validity against catalog • Separate into valid/invalid queues • Another process sucks data into HDFS and organize in a directory hierarchy • Events • Grouped by date16 © Karmasphere 2011 All rights reserved
  17. 17. V2: computation abstraction • Common tasks • Projection • What fields am I interested in • Filtering • What records I am interested in • Aggregations • What do I want to do with the metrics • Common readers and writers for data types • Captured in libraries that can be composed for complex analytics17 © Karmasphere 2011 All rights reserved
  18. 18. V2: better dashboards • Metrics summarized in MySQL databases • Interactive dashboards using Ruby/Senatra • Select metrics • Time range • Aggregation method • Plot results using FusionCharts • OpenCharts was a close second, but no combined charts (Histograms, line charts)18 © Karmasphere 2011 All rights reserved
  19. 19. What did we learn from V2 • HDFS I/O is better than the local disk • No need for the process that saves locally then to HDFS • People loved events • Led to event abuse • Each feature on the page had an associated event • Events were used for performance tuning: how much time did a feature take • Events were used for monitoring backend features: record errors with services • Large number of files cause problems for the namenode • Need to coalesce events to reduce file number • With flexible event types, and interactive dashboards, people have more questions • We couldn’t keep up with developing custom metrics and charts • Needed a self serve query mechanism19 © Karmasphere 2011 All rights reserved
  20. 20. V3: ingestion • Minor modifications • Collectors now write to HDFS • Collectors accumulate events to reduce file number • Self serve UI for defining new events outside of the metrics team20 © Karmasphere 2011 All rights reserved
  21. 21. V3: computation • Need a higher level language for query • JSON API exposing a search like query syntax • {from: ‘date’, to: ‘date’, metric:’x’, computation} • Computations are encapsulated into libraries and exposed through JSON • Users can add metrics and computations and build frontends for the query language • Custom code for ML tasks • Cascading for algorithms • R for visualization21 © Karmasphere 2011 All rights reserved
  22. 22. V3: dashboards • More intermediate data precomputed • Data stored in Hbase • Dashboards go against HBase • Templates for users to build custom dashboards22 © Karmasphere 2011 All rights reserved
  23. 23. V3: What did we learn • Self serve is the way to go • Give people the infrastructure and the support libraries and they’ll go to town • Some tasks still can’t be done in a framework and needs custom code • Machine learning, with analysis on R • ML is hard, even with experience • Data is not clean • Some content is very small • Comments on pictures and videos (workarounds for aggregation) • Even then you can build products around the results • People and network recommenders • Network categories for ads23 © Karmasphere 2011 All rights reserved
  24. 24. How would we do it differently today • Open source obviates custom code • Scribe for data ingestion • Hive for self serve analytics and business intelligence • Pig scripts subsume most of the Java code • Cascading for Java map-reduce • Dashboards still stay the same24 © Karmasphere 2011 All rights reserved
  25. 25. Epilogue • ML analysis showed most usage is spam • Shutdown a lot of pr0n networks and video hosting networks in far east Asia • Team moved to different companies • Still in analytics at LI, FB, and twitter • Company changed business model to for pay only and laid off half the staff 6 months later • Company acquired recently25 © Karmasphere 2011 All rights reserved
  26. 26. Takeaway • The problems and solutions are mostly the same everywhere • Getting data into Hadoop • How do you compute over the data • Getting meaningful data out of Hadoop • Lots of software components exist to help you with these • It is about the balance of what you develop vs what you acquire26 © Karmasphere 2011 All rights reserved
  27. 27. Q&A27 © Karmasphere 2011 All rights reserved
  28. 28. The Leader in Big Data Intelligence on Hadoop www.karmasphere.com28 © Karmasphere 2011 All rights reserved