Chris Curtin discusses Silverpop's journey with Hadoop. They initially used Hadoop to build flexible reports on customer data despite varying schemas. This helped with queries but was difficult to maintain. They then used Cascading to dynamically define schemas and job steps. Next, they profiled customer interactions over time which challenged Hadoop due to many small files and lack of appending. They switched to MapR which helped but recovery remained an issue. Current work includes optimizing imports, packaging the solution, and watching new real-time Hadoop technologies. The main challenges have been helping customers understand and use insights from large and complex data.
2. About Me
• 20+ years in technology
• Head ofTechnical Research at Silverpop (12 + years at Silverpop)
• Built a SaaS platform before the term ‘SaaS’ was being used
• Prior to Silverpop: real-time control systems, factory automation
and warehouse management
• Always looking for technologies and algorithms to help with our
challenges
• Car nut
2
3. Silverpop Open Positions
• Senior Software Engineer (Java, Oracle, Spring, Hibernate, MongoDB)
• Senior Software Engineer – MIS (.NET stack)
• Software Engineer
• Software Engineer – Integration Services (PHP, MySQL)
• Delivery Manager – Engineering
• Technical Lead – Engineering
• Technical Project Manager – Integration Services
• http://www.silverpop.com – Go to Careers under About
3
4. About Silverpop
• Founded in late 1999, Atlanta based, offices in London, Germany,
Irvine California
• Digital MarketingTechnology provider, unifying marketing
automation, email, mobile and social.
• Track billions of contact events, execute on those events, send
billions of emails
• Clients are in marketing departments
4
5. Challenge from the business
• Engage allows clients to define their own database schema for
contact records
• No two client’s schemas are the same
• Schemas often change weekly/monthly
• Contact’s records are ‘point in time’
• Users want to report on value of a contact record when activity
occurred
5
6. Example
• How well did my marketing campaign to my loyalty clients do last
quarter?
• Easy question, hard answer
– Contact’s ‘level’ changes throughout the year (Silver to Gold)
– Some piece of data wasn’t known at the time of the email send, but is
now
– What do you want to pivot on? Level? Age? Source Code?Time in
database?
6
8. Hadoop
• Started working on R&D project in 2008
• First raw map/reduce
• Some Pig
• Some Hive/Hbase
• (and several start-ups long since dead …)
• Flexible schema caused problems with most of them
8
9. First ‘real’ application
• Pivot reports against flexible schemas
• Per contact, not aggregate
• Let the user select any communication(s), see what user attributes
are available to use as pivots
• Pivot data is at time of communication, not current values (slow
moving data)
• Could be against a few thousand events, to billions
9
10. First ‘real’ challenges
• Flexible schema meant Hbase, Hive etc. wouldn’t work easily
• Flexible schema meant Pig scripts were difficult to maintain (even
generating on the fly)
• Need to coordinate multiple steps OUTSIDE of the Hadoop
process
• UI
• Resource Allocation and control
10
11. Cascading
• Answered a number of problems
• Allowed integration with other platforms, even between M/R jobs
– MySQL to find list of supported columns
– HDFS to find actual files on disk
– JMS for job sourcing/status updates (not implemented)
11
12. Cascading Dynamic Schema Solution
• Allows the definition of schema at run time
• Allows definition of steps at run time
– One report may have 10 mailings, another 10,000
– 10,000 mailings can’t be run in parallel, so programmatically create
temporary results
12
14. Client Response
• Either got it immediately or didn’t see the need for something this
flexible
• Found a reason to talk to others in organization to find other pivot
fields
• Most common use case: behaviors based on Source Code
• Turned out to be a weekly/monthly report not a day-to-day tool
• Some used it for ad hoc, but to build a requirement for their BI
teams
14
15. ProfilingApplication
• Retention is a big theme in marketing
• Looking at a single mailing/ad buy etc. showed aggregates about
that slice of time, but are misleading:
– Is the 20% who opened that email the same 20% as last week?
– For people in my database for 6 months, how often do they interact
with my marketing?
– What is a typical interaction rate for my database?
– How many times on average does a contact interact with me in a
month?Who is outside of that rate?
• Instead of looking across communication now needed to look at
each contact
15
16. New technical challenges
• Previous report could be broken into specific steps to reduce
volume of events before ‘heavy’ math was done
• New report needs to look at all events together
• Quickly overwhelmed scheduler
16
17. HadoopChallenges
• No schema – external store of mappings
• No appending in HDFS – daily integration could be 10MM rows for
a communication or 5
• ‘lots of small files’ – thousands of clients with thousands of
communications means millions of files
• ETL from Oracle meant concatenating files weekly to keep count
down
• Single point of failure (Name Node) took long time to recover
• Non-batch processes, how to schedule jobs on demand?
• Hadoop Job History – memory vs. concurrent job tradeoffs
17
18. MapR
• Eventually settled on MapR M3
– Large number of files was main driver
– NFS mount is nice feature
– Cascading works
• Not without issues
– Found several bugs aroundVolumes in HDFS and log retention that
we had to work around (later fixed)
– Can’t copy between volumes using HDFS commands
– More complicated for operations to manage (had a CLDB failure that
took a day to recover, mostly us trying to figure out what to do.)
18
19. Misc.Technical Information
• Fair Scheduler
– Our scheduling logic knows how many queues and controls how many
jobs can be submitted at the same time
• Mapr ExpressLane is useful for small jobs
– Our scheduler knows it is a small job so lets MapR take it
• Mapr’s NFS mount is great
– Write directly to it from Java apps instead of HDFS API
– Concatenating daily files is a simple Java app now
– (Still don’t append to files in HDFS, but could)
• Nagios for monitoring
19
20. Cluster details
• 5 nodes
– 1 admin, 4 workers
– 8 core Xeon 16 GB
– 5TB usable per box assigned to MapR
• Had 9 nodes, reduced to 5
– Cluster was mostly idle due to user’s submittal patterns (heavy on
Tuesdays, 7th day of the month)
– Delay to end users was minimal when we reduced the number of
machines
20
21. Closing the loop
• Next logical step was for clients to ask to target the contacts
• The volume of data didn’t make that easily possible
• Integrating from Hadoop back to Oracle became an ETL project
– Export from Oracle was single dump, import would be a job per client.
• Automation of reports (and emailing results) was 2nd most asked
for feature
• Lots of support required to know what to do with the results
– No easy ‘go do this when you see this in the reports’
21
22. Current Status
• Dozens of monthly users
• Some optimizations to toss data early in the import step for clients
not using the tool
• Packaging and pricing is vexing the product marketing team
• Runs lights out unless the ETL process breaks
22
23. Business Challenges
• Lots of cool ideas we came up with, even implemented a few
• But end users didn’t know what to do with the data
• ‘SaaS-ifying’ is proving difficult
– Multi-tenancy resource management is not available
– How to price? End report may have 20 rows but processed 1BN rows
to get there
• If I hear ‘do you do big data’ one more time …
23
24. Things we are watching
• Real-time tools on top of Hadoop (Drill, Impala)
• Storm inside ofYARN
• Storm in general
• Integration of Kafka, Storm, Drill/Impala, Hadoop & MongoDB
24