Successfully reported this slideshow.

Atlanta hadoop users group july 2013


Published on

Published in: Technology, Business
  • Be the first to like this

Atlanta hadoop users group july 2013

  1. 1. Our Hadoop Journey Chris Curtin Head ofTechnical Research Atlanta Hadoop Users Group July 2013
  2. 2. About Me • 20+ years in technology • Head ofTechnical Research at Silverpop (12 + years at Silverpop) • Built a SaaS platform before the term ‘SaaS’ was being used • Prior to Silverpop: real-time control systems, factory automation and warehouse management • Always looking for technologies and algorithms to help with our challenges • Car nut 2
  3. 3. Silverpop Open Positions • Senior Software Engineer (Java, Oracle, Spring, Hibernate, MongoDB) • Senior Software Engineer – MIS (.NET stack) • Software Engineer • Software Engineer – Integration Services (PHP, MySQL) • Delivery Manager – Engineering • Technical Lead – Engineering • Technical Project Manager – Integration Services • – Go to Careers under About 3
  4. 4. About Silverpop • Founded in late 1999, Atlanta based, offices in London, Germany, Irvine California • Digital MarketingTechnology provider, unifying marketing automation, email, mobile and social. • Track billions of contact events, execute on those events, send billions of emails • Clients are in marketing departments 4
  5. 5. Challenge from the business • Engage allows clients to define their own database schema for contact records • No two client’s schemas are the same • Schemas often change weekly/monthly • Contact’s records are ‘point in time’ • Users want to report on value of a contact record when activity occurred 5
  6. 6. Example • How well did my marketing campaign to my loyalty clients do last quarter? • Easy question, hard answer – Contact’s ‘level’ changes throughout the year (Silver to Gold) – Some piece of data wasn’t known at the time of the email send, but is now – What do you want to pivot on? Level? Age? Source Code?Time in database? 6
  7. 7. Technical solutions • Traditional Data warehouse • Queries against OLTP or OLAP stores • Customer-specific databases 7
  8. 8. Hadoop • Started working on R&D project in 2008 • First raw map/reduce • Some Pig • Some Hive/Hbase • (and several start-ups long since dead …) • Flexible schema caused problems with most of them 8
  9. 9. First ‘real’ application • Pivot reports against flexible schemas • Per contact, not aggregate • Let the user select any communication(s), see what user attributes are available to use as pivots • Pivot data is at time of communication, not current values (slow moving data) • Could be against a few thousand events, to billions 9
  10. 10. First ‘real’ challenges • Flexible schema meant Hbase, Hive etc. wouldn’t work easily • Flexible schema meant Pig scripts were difficult to maintain (even generating on the fly) • Need to coordinate multiple steps OUTSIDE of the Hadoop process • UI • Resource Allocation and control 10
  11. 11. Cascading • Answered a number of problems • Allowed integration with other platforms, even between M/R jobs – MySQL to find list of supported columns – HDFS to find actual files on disk – JMS for job sourcing/status updates (not implemented) 11
  12. 12. Cascading Dynamic Schema Solution • Allows the definition of schema at run time • Allows definition of steps at run time – One report may have 10 mailings, another 10,000 – 10,000 mailings can’t be run in parallel, so programmatically create temporary results 12
  13. 13. SampleCascadingCode 13
  14. 14. Client Response • Either got it immediately or didn’t see the need for something this flexible • Found a reason to talk to others in organization to find other pivot fields • Most common use case: behaviors based on Source Code • Turned out to be a weekly/monthly report not a day-to-day tool • Some used it for ad hoc, but to build a requirement for their BI teams 14
  15. 15. ProfilingApplication • Retention is a big theme in marketing • Looking at a single mailing/ad buy etc. showed aggregates about that slice of time, but are misleading: – Is the 20% who opened that email the same 20% as last week? – For people in my database for 6 months, how often do they interact with my marketing? – What is a typical interaction rate for my database? – How many times on average does a contact interact with me in a month?Who is outside of that rate? • Instead of looking across communication now needed to look at each contact 15
  16. 16. New technical challenges • Previous report could be broken into specific steps to reduce volume of events before ‘heavy’ math was done • New report needs to look at all events together • Quickly overwhelmed scheduler 16
  17. 17. HadoopChallenges • No schema – external store of mappings • No appending in HDFS – daily integration could be 10MM rows for a communication or 5 • ‘lots of small files’ – thousands of clients with thousands of communications means millions of files • ETL from Oracle meant concatenating files weekly to keep count down • Single point of failure (Name Node) took long time to recover • Non-batch processes, how to schedule jobs on demand? • Hadoop Job History – memory vs. concurrent job tradeoffs 17
  18. 18. MapR • Eventually settled on MapR M3 – Large number of files was main driver – NFS mount is nice feature – Cascading works • Not without issues – Found several bugs aroundVolumes in HDFS and log retention that we had to work around (later fixed) – Can’t copy between volumes using HDFS commands – More complicated for operations to manage (had a CLDB failure that took a day to recover, mostly us trying to figure out what to do.) 18
  19. 19. Misc.Technical Information • Fair Scheduler – Our scheduling logic knows how many queues and controls how many jobs can be submitted at the same time • Mapr ExpressLane is useful for small jobs – Our scheduler knows it is a small job so lets MapR take it • Mapr’s NFS mount is great – Write directly to it from Java apps instead of HDFS API – Concatenating daily files is a simple Java app now – (Still don’t append to files in HDFS, but could) • Nagios for monitoring 19
  20. 20. Cluster details • 5 nodes – 1 admin, 4 workers – 8 core Xeon 16 GB – 5TB usable per box assigned to MapR • Had 9 nodes, reduced to 5 – Cluster was mostly idle due to user’s submittal patterns (heavy on Tuesdays, 7th day of the month) – Delay to end users was minimal when we reduced the number of machines 20
  21. 21. Closing the loop • Next logical step was for clients to ask to target the contacts • The volume of data didn’t make that easily possible • Integrating from Hadoop back to Oracle became an ETL project – Export from Oracle was single dump, import would be a job per client. • Automation of reports (and emailing results) was 2nd most asked for feature • Lots of support required to know what to do with the results – No easy ‘go do this when you see this in the reports’ 21
  22. 22. Current Status • Dozens of monthly users • Some optimizations to toss data early in the import step for clients not using the tool • Packaging and pricing is vexing the product marketing team • Runs lights out unless the ETL process breaks 22
  23. 23. Business Challenges • Lots of cool ideas we came up with, even implemented a few • But end users didn’t know what to do with the data • ‘SaaS-ifying’ is proving difficult – Multi-tenancy resource management is not available – How to price? End report may have 20 rows but processed 1BN rows to get there • If I hear ‘do you do big data’ one more time … 23
  24. 24. Things we are watching • Real-time tools on top of Hadoop (Drill, Impala) • Storm inside ofYARN • Storm in general • Integration of Kafka, Storm, Drill/Impala, Hadoop & MongoDB 24
  25. 25. Information • Slides: • Me: @ChrisCurtin on twitter 25
  26. 26. Silverpop Open Positions • Senior Software Engineer (Java, Oracle, Spring, Hibernate, MongoDB) • Senior Software Engineer – MIS (.NET stack) • Software Engineer • Software Engineer – Integration Services (PHP, MySQL) • Delivery Manager – Engineering • Technical Lead – Engineering • Technical Project Manager – Integration Services • positions.html 26