SlideShare a Scribd company logo
1 of 26
Our Hadoop Journey
Chris Curtin
Head ofTechnical Research
Atlanta Hadoop Users Group July 2013
About Me
• 20+ years in technology
• Head ofTechnical Research at Silverpop (12 + years at Silverpop)
• Built a SaaS platform before the term ‘SaaS’ was being used
• Prior to Silverpop: real-time control systems, factory automation
and warehouse management
• Always looking for technologies and algorithms to help with our
challenges
• Car nut
2
Silverpop Open Positions
• Senior Software Engineer (Java, Oracle, Spring, Hibernate, MongoDB)
• Senior Software Engineer – MIS (.NET stack)
• Software Engineer
• Software Engineer – Integration Services (PHP, MySQL)
• Delivery Manager – Engineering
• Technical Lead – Engineering
• Technical Project Manager – Integration Services
• http://www.silverpop.com – Go to Careers under About
3
About Silverpop
• Founded in late 1999, Atlanta based, offices in London, Germany,
Irvine California
• Digital MarketingTechnology provider, unifying marketing
automation, email, mobile and social.
• Track billions of contact events, execute on those events, send
billions of emails
• Clients are in marketing departments
4
Challenge from the business
• Engage allows clients to define their own database schema for
contact records
• No two client’s schemas are the same
• Schemas often change weekly/monthly
• Contact’s records are ‘point in time’
• Users want to report on value of a contact record when activity
occurred
5
Example
• How well did my marketing campaign to my loyalty clients do last
quarter?
• Easy question, hard answer
– Contact’s ‘level’ changes throughout the year (Silver to Gold)
– Some piece of data wasn’t known at the time of the email send, but is
now
– What do you want to pivot on? Level? Age? Source Code?Time in
database?
6
Technical solutions
• Traditional Data warehouse
• Queries against OLTP or OLAP stores
• Customer-specific databases
7
Hadoop
• Started working on R&D project in 2008
• First raw map/reduce
• Some Pig
• Some Hive/Hbase
• (and several start-ups long since dead …)
• Flexible schema caused problems with most of them
8
First ‘real’ application
• Pivot reports against flexible schemas
• Per contact, not aggregate
• Let the user select any communication(s), see what user attributes
are available to use as pivots
• Pivot data is at time of communication, not current values (slow
moving data)
• Could be against a few thousand events, to billions
9
First ‘real’ challenges
• Flexible schema meant Hbase, Hive etc. wouldn’t work easily
• Flexible schema meant Pig scripts were difficult to maintain (even
generating on the fly)
• Need to coordinate multiple steps OUTSIDE of the Hadoop
process
• UI
• Resource Allocation and control
10
Cascading
• Answered a number of problems
• Allowed integration with other platforms, even between M/R jobs
– MySQL to find list of supported columns
– HDFS to find actual files on disk
– JMS for job sourcing/status updates (not implemented)
11
Cascading Dynamic Schema Solution
• Allows the definition of schema at run time
• Allows definition of steps at run time
– One report may have 10 mailings, another 10,000
– 10,000 mailings can’t be run in parallel, so programmatically create
temporary results
12
SampleCascadingCode
13
Client Response
• Either got it immediately or didn’t see the need for something this
flexible
• Found a reason to talk to others in organization to find other pivot
fields
• Most common use case: behaviors based on Source Code
• Turned out to be a weekly/monthly report not a day-to-day tool
• Some used it for ad hoc, but to build a requirement for their BI
teams
14
ProfilingApplication
• Retention is a big theme in marketing
• Looking at a single mailing/ad buy etc. showed aggregates about
that slice of time, but are misleading:
– Is the 20% who opened that email the same 20% as last week?
– For people in my database for 6 months, how often do they interact
with my marketing?
– What is a typical interaction rate for my database?
– How many times on average does a contact interact with me in a
month?Who is outside of that rate?
• Instead of looking across communication now needed to look at
each contact
15
New technical challenges
• Previous report could be broken into specific steps to reduce
volume of events before ‘heavy’ math was done
• New report needs to look at all events together
• Quickly overwhelmed scheduler
16
HadoopChallenges
• No schema – external store of mappings
• No appending in HDFS – daily integration could be 10MM rows for
a communication or 5
• ‘lots of small files’ – thousands of clients with thousands of
communications means millions of files
• ETL from Oracle meant concatenating files weekly to keep count
down
• Single point of failure (Name Node) took long time to recover
• Non-batch processes, how to schedule jobs on demand?
• Hadoop Job History – memory vs. concurrent job tradeoffs
17
MapR
• Eventually settled on MapR M3
– Large number of files was main driver
– NFS mount is nice feature
– Cascading works
• Not without issues
– Found several bugs aroundVolumes in HDFS and log retention that
we had to work around (later fixed)
– Can’t copy between volumes using HDFS commands
– More complicated for operations to manage (had a CLDB failure that
took a day to recover, mostly us trying to figure out what to do.)
18
Misc.Technical Information
• Fair Scheduler
– Our scheduling logic knows how many queues and controls how many
jobs can be submitted at the same time
• Mapr ExpressLane is useful for small jobs
– Our scheduler knows it is a small job so lets MapR take it
• Mapr’s NFS mount is great
– Write directly to it from Java apps instead of HDFS API
– Concatenating daily files is a simple Java app now
– (Still don’t append to files in HDFS, but could)
• Nagios for monitoring
19
Cluster details
• 5 nodes
– 1 admin, 4 workers
– 8 core Xeon 16 GB
– 5TB usable per box assigned to MapR
• Had 9 nodes, reduced to 5
– Cluster was mostly idle due to user’s submittal patterns (heavy on
Tuesdays, 7th day of the month)
– Delay to end users was minimal when we reduced the number of
machines
20
Closing the loop
• Next logical step was for clients to ask to target the contacts
• The volume of data didn’t make that easily possible
• Integrating from Hadoop back to Oracle became an ETL project
– Export from Oracle was single dump, import would be a job per client.
• Automation of reports (and emailing results) was 2nd most asked
for feature
• Lots of support required to know what to do with the results
– No easy ‘go do this when you see this in the reports’
21
Current Status
• Dozens of monthly users
• Some optimizations to toss data early in the import step for clients
not using the tool
• Packaging and pricing is vexing the product marketing team
• Runs lights out unless the ETL process breaks
22
Business Challenges
• Lots of cool ideas we came up with, even implemented a few
• But end users didn’t know what to do with the data
• ‘SaaS-ifying’ is proving difficult
– Multi-tenancy resource management is not available
– How to price? End report may have 20 rows but processed 1BN rows
to get there
• If I hear ‘do you do big data’ one more time …
23
Things we are watching
• Real-time tools on top of Hadoop (Drill, Impala)
• Storm inside ofYARN
• Storm in general
• Integration of Kafka, Storm, Drill/Impala, Hadoop & MongoDB
24
Information
• Slides: http://www.slideshare.net/chriscurtin
• Me: ccurtin@silverpop.com @ChrisCurtin on twitter
25
Silverpop Open Positions
• Senior Software Engineer (Java, Oracle, Spring, Hibernate, MongoDB)
• Senior Software Engineer – MIS (.NET stack)
• Software Engineer
• Software Engineer – Integration Services (PHP, MySQL)
• Delivery Manager – Engineering
• Technical Lead – Engineering
• Technical Project Manager – Integration Services
• http://www.silverpop.com/marketing-company/careers/open-
positions.html
26

More Related Content

What's hot

Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Puree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using InteranaPuree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using InteranaJagjit Srawan
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineeringnathanmarz
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonData Con LA
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureOliver Buckley-Salmon
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkDataWorks Summit
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosOpenSistemas
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetCarl W. Handlin
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA
 
Traveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsTraveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsRendy Bambang Junior
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamTraveloka
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 

What's hot (20)

Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Final deck
Final deckFinal deck
Final deck
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Puree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using InteranaPuree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using Interana
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
JBCN barcelona 2017 kappa architecture 2.0
JBCN barcelona 2017 kappa architecture 2.0JBCN barcelona 2017 kappa architecture 2.0
JBCN barcelona 2017 kappa architecture 2.0
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectos
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache Superset
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
 
Traveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsTraveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analytics
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data Team
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 

Similar to Atlanta hadoop users group july 2013

Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesRob Winters
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopCaserta
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Rittman Analytics
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
Utah Big Mountain   Big Data Baby Steps (4-12-2014) FinalUtah Big Mountain   Big Data Baby Steps (4-12-2014) Final
Utah Big Mountain Big Data Baby Steps (4-12-2014) FinalNick Baguley
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Altan Khendup
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...BigDataEverywhere
 

Similar to Atlanta hadoop users group july 2013 (20)

Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
 
Tech view on Regulatory Compliance
Tech view on Regulatory ComplianceTech view on Regulatory Compliance
Tech view on Regulatory Compliance
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
 
Retail & CPG
Retail & CPGRetail & CPG
Retail & CPG
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
Utah Big Mountain   Big Data Baby Steps (4-12-2014) FinalUtah Big Mountain   Big Data Baby Steps (4-12-2014) Final
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 

More from Christopher Curtin

2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Christopher Curtin
 
AJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleAJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleChristopher Curtin
 
AJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleAJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleChristopher Curtin
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010Christopher Curtin
 

More from Christopher Curtin (6)

2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 
AJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleAJUG April 2011 Cascading example
AJUG April 2011 Cascading example
 
AJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleAJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop example
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
IASA Atlanta September 2009
IASA Atlanta September 2009IASA Atlanta September 2009
IASA Atlanta September 2009
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Atlanta hadoop users group july 2013

  • 1. Our Hadoop Journey Chris Curtin Head ofTechnical Research Atlanta Hadoop Users Group July 2013
  • 2. About Me • 20+ years in technology • Head ofTechnical Research at Silverpop (12 + years at Silverpop) • Built a SaaS platform before the term ‘SaaS’ was being used • Prior to Silverpop: real-time control systems, factory automation and warehouse management • Always looking for technologies and algorithms to help with our challenges • Car nut 2
  • 3. Silverpop Open Positions • Senior Software Engineer (Java, Oracle, Spring, Hibernate, MongoDB) • Senior Software Engineer – MIS (.NET stack) • Software Engineer • Software Engineer – Integration Services (PHP, MySQL) • Delivery Manager – Engineering • Technical Lead – Engineering • Technical Project Manager – Integration Services • http://www.silverpop.com – Go to Careers under About 3
  • 4. About Silverpop • Founded in late 1999, Atlanta based, offices in London, Germany, Irvine California • Digital MarketingTechnology provider, unifying marketing automation, email, mobile and social. • Track billions of contact events, execute on those events, send billions of emails • Clients are in marketing departments 4
  • 5. Challenge from the business • Engage allows clients to define their own database schema for contact records • No two client’s schemas are the same • Schemas often change weekly/monthly • Contact’s records are ‘point in time’ • Users want to report on value of a contact record when activity occurred 5
  • 6. Example • How well did my marketing campaign to my loyalty clients do last quarter? • Easy question, hard answer – Contact’s ‘level’ changes throughout the year (Silver to Gold) – Some piece of data wasn’t known at the time of the email send, but is now – What do you want to pivot on? Level? Age? Source Code?Time in database? 6
  • 7. Technical solutions • Traditional Data warehouse • Queries against OLTP or OLAP stores • Customer-specific databases 7
  • 8. Hadoop • Started working on R&D project in 2008 • First raw map/reduce • Some Pig • Some Hive/Hbase • (and several start-ups long since dead …) • Flexible schema caused problems with most of them 8
  • 9. First ‘real’ application • Pivot reports against flexible schemas • Per contact, not aggregate • Let the user select any communication(s), see what user attributes are available to use as pivots • Pivot data is at time of communication, not current values (slow moving data) • Could be against a few thousand events, to billions 9
  • 10. First ‘real’ challenges • Flexible schema meant Hbase, Hive etc. wouldn’t work easily • Flexible schema meant Pig scripts were difficult to maintain (even generating on the fly) • Need to coordinate multiple steps OUTSIDE of the Hadoop process • UI • Resource Allocation and control 10
  • 11. Cascading • Answered a number of problems • Allowed integration with other platforms, even between M/R jobs – MySQL to find list of supported columns – HDFS to find actual files on disk – JMS for job sourcing/status updates (not implemented) 11
  • 12. Cascading Dynamic Schema Solution • Allows the definition of schema at run time • Allows definition of steps at run time – One report may have 10 mailings, another 10,000 – 10,000 mailings can’t be run in parallel, so programmatically create temporary results 12
  • 14. Client Response • Either got it immediately or didn’t see the need for something this flexible • Found a reason to talk to others in organization to find other pivot fields • Most common use case: behaviors based on Source Code • Turned out to be a weekly/monthly report not a day-to-day tool • Some used it for ad hoc, but to build a requirement for their BI teams 14
  • 15. ProfilingApplication • Retention is a big theme in marketing • Looking at a single mailing/ad buy etc. showed aggregates about that slice of time, but are misleading: – Is the 20% who opened that email the same 20% as last week? – For people in my database for 6 months, how often do they interact with my marketing? – What is a typical interaction rate for my database? – How many times on average does a contact interact with me in a month?Who is outside of that rate? • Instead of looking across communication now needed to look at each contact 15
  • 16. New technical challenges • Previous report could be broken into specific steps to reduce volume of events before ‘heavy’ math was done • New report needs to look at all events together • Quickly overwhelmed scheduler 16
  • 17. HadoopChallenges • No schema – external store of mappings • No appending in HDFS – daily integration could be 10MM rows for a communication or 5 • ‘lots of small files’ – thousands of clients with thousands of communications means millions of files • ETL from Oracle meant concatenating files weekly to keep count down • Single point of failure (Name Node) took long time to recover • Non-batch processes, how to schedule jobs on demand? • Hadoop Job History – memory vs. concurrent job tradeoffs 17
  • 18. MapR • Eventually settled on MapR M3 – Large number of files was main driver – NFS mount is nice feature – Cascading works • Not without issues – Found several bugs aroundVolumes in HDFS and log retention that we had to work around (later fixed) – Can’t copy between volumes using HDFS commands – More complicated for operations to manage (had a CLDB failure that took a day to recover, mostly us trying to figure out what to do.) 18
  • 19. Misc.Technical Information • Fair Scheduler – Our scheduling logic knows how many queues and controls how many jobs can be submitted at the same time • Mapr ExpressLane is useful for small jobs – Our scheduler knows it is a small job so lets MapR take it • Mapr’s NFS mount is great – Write directly to it from Java apps instead of HDFS API – Concatenating daily files is a simple Java app now – (Still don’t append to files in HDFS, but could) • Nagios for monitoring 19
  • 20. Cluster details • 5 nodes – 1 admin, 4 workers – 8 core Xeon 16 GB – 5TB usable per box assigned to MapR • Had 9 nodes, reduced to 5 – Cluster was mostly idle due to user’s submittal patterns (heavy on Tuesdays, 7th day of the month) – Delay to end users was minimal when we reduced the number of machines 20
  • 21. Closing the loop • Next logical step was for clients to ask to target the contacts • The volume of data didn’t make that easily possible • Integrating from Hadoop back to Oracle became an ETL project – Export from Oracle was single dump, import would be a job per client. • Automation of reports (and emailing results) was 2nd most asked for feature • Lots of support required to know what to do with the results – No easy ‘go do this when you see this in the reports’ 21
  • 22. Current Status • Dozens of monthly users • Some optimizations to toss data early in the import step for clients not using the tool • Packaging and pricing is vexing the product marketing team • Runs lights out unless the ETL process breaks 22
  • 23. Business Challenges • Lots of cool ideas we came up with, even implemented a few • But end users didn’t know what to do with the data • ‘SaaS-ifying’ is proving difficult – Multi-tenancy resource management is not available – How to price? End report may have 20 rows but processed 1BN rows to get there • If I hear ‘do you do big data’ one more time … 23
  • 24. Things we are watching • Real-time tools on top of Hadoop (Drill, Impala) • Storm inside ofYARN • Storm in general • Integration of Kafka, Storm, Drill/Impala, Hadoop & MongoDB 24
  • 25. Information • Slides: http://www.slideshare.net/chriscurtin • Me: ccurtin@silverpop.com @ChrisCurtin on twitter 25
  • 26. Silverpop Open Positions • Senior Software Engineer (Java, Oracle, Spring, Hibernate, MongoDB) • Senior Software Engineer – MIS (.NET stack) • Software Engineer • Software Engineer – Integration Services (PHP, MySQL) • Delivery Manager – Engineering • Technical Lead – Engineering • Technical Project Manager – Integration Services • http://www.silverpop.com/marketing-company/careers/open- positions.html 26