SlideShare a Scribd company logo
1
Big Data, Baby Steps
“What Every Leader Should Consider When Starting a Big Data Initiative”
April 12, 2014
Goal for this presentation
“Big data is like teenage sex: everyone talks about it,
nobody really knows how to do it, everyone thinks everyone
else is doing it, so everyone claims they are doing it...”
- Dan Ariely on Facebook Jan 6, 2013 and “others”
2
Why Me? Why Ancestry?
• Established consumer facing Web company looking to
leverage our data
• Started with Hadoop and HBase in 2012 on AncestryDNA
• When we started, I looked for guidance – it was missing
• Learn from us: what works, what didn’t, how to adjust
Agenda
• What to consider before you start
• Understand the Hadoop ecosystem
– What pieces is Ancestry using and why?
– Big Data architecture at Ancestry
• Hadoop distributions
• Big Data consultants
• How to build your team(s)
• Custom logs
– Other companies and Ancestry specifics
• Top three things to remember
3
Gartner new technology hype cycles
Where is Big Data (and Big Data Analytics) on this curve?
4
Source: Gartner August 2013
What to consider before you start
• Big Data, Business Intelligence, and Analytics are tied
– Analytics is an umbrella term that represents the entire
ecosystem needed to turn data into actions
• Understand your “data”
– Web click stream data, sales transactions, advertising data,
fraud detection, sensor data, social data, etc.
• Visualize your final goal and work backwards
– Imagine (prototype) the dashboards, analytics, and actions
that will be available
• Deliver value to the business at each step
– “Goal of analytics is not to produce actionable insights; the
goal is to produce results.” Ken Rudin
5
Understand the Hadoop ecosystem
• Hadoop 2.0 and HDFS (Yarn)
• Workflow
• NoSQL
• Data Organization
• Log collection
• Near Real-Time Stream Processing
• NFS File System on HDFS
6
What are the pieces Ancestry is using?
We use or plan to use:
Yarn and Ambari
Forensics on log data:
Visualization:
(Graphs + Deep Zoom)
7
Visualization
Company that used traditional “Cubes” and Excel
– Business Intelligence/Data Warehouse world has moved
beyond cubes
– Great product that didn’t work for us
– People went back to using Excel
– In two weeks, 30 people created 120+ dashboards and reports
– Tied to an MPP Data Warehouse is changing our company
– Created the “Wild, wild, west” - fixing with a blessed portal
8
Hadoop distributions
• Open Source, Active Community, Large Eco-System of
Projects, requires more internal knowledge and support
• First Distribution, Large “War Chest” (Cash Investment),
Impala, and the Cloudera Console
• Custom file system (API equivalent to HDFS) that improves
performance, custom Hbase implementation, High
Availability Features
• Closest to Apache Hadoop, tested on Yahoo!’s 7000 node
cluster before being released.
• Several Cloud options: Google and Amazon. Quick and
easy to get going. Great way to experiment and learn.
Watch your data storage costs
9
Typical Big Data architecture
Cassandra Repo
Users Properties
User
Properties
User
Segments
Rules
Defines
Samza Stream
Processing
Stream A Stream B
Stream C
Kafka
Stream Repo
Runs on
Hadoop
System of Records
Simple ETL
Raw Data
Global Properties & Models
Marketing
Segmentation and
Targeting Managment
Expose to the
Web Site
User Facing Stacks and Services
Log Forwarder Kafka Producer
EDW
(MPP)
Simple ELT
MapReduce
ETL
Designs
Actions Feeds
10
Ancestry system diagram
11
Hadoop
System of Records
Dogwood
ELT
User 360 Services Initiative
Kafka Log Forwarder
EDW
ParAccel
MapReduce
ETL
Splunk Alternative Initiative
Operation Monitoring Reporting
Initiative
Stream Kafka
Samza Stream
Processing
Stream A Stream B
Stream C
Notification Service
Mirror
.Net Stack
Java Stack
JVM stack
Vert.X stack
Node.js Stack Python Stack
Kafka Producer
Aspect
Aspect
Aggr ETL
ETL Kafka
Actions
Feeds
Production HadoopTableau
Elastic Search
Kibana
How to organize and build your team(s)
• Hiring vs. training smart developers in your organization
– Training
▫ Self-starters who can train themselves
▫ Online training that is free or with minimal cost
▫ Paid training for specific technologies
– Promote your technology and people will reach out to you
▫ Bit of a chicken and egg problem
• Key roles for the team
– Developers who understand operations
– Hadoop engineers
– Team leaders and managers
12
Big Data consultants
• Lots of them, charging lots of money
• Not all of them are created equal
• Prefer consultants who are vendor agnostic
• Find consultants who have experience in what you want
to do
• Check references
13
Companies working with custom logs
14
• Scribe, Scuba, Hive, and Hadoop as the data
warehouse infrastructure. Run over 10K Hive
scripts daily to crunch log data. Analyst on
each team to make sure logging is correct.
• Uses a very simple interface similar to log4j to
log data. How to keep this accurate?
• Tried Scribe. Implemented Kafka and Avro to
collect log data. Use a binary format with a
schema registry.
• Recently open sourced their log collecting
infrastructure (Suro – Data Pipeline).
“Used to be a web site that occasionally logged data. Now
we’re a logging engine that occasionally serves as a web
site.”
Collecting custom logs at Ancestry
• Framework piece with a “Logging Aspect”
– Logging is a cross cutting concern
– Avoid breaking changes
– Annotations for parameter names (normalization layer)
• Defined Big Data headers that must be present in every
log (User ID, Anonymous ID, Session, Request ID, Client)
– Stitch data together
– Partitioned in Hive by day/month/year
– JSON payload
– Validate messages sent vs. messages received
– Schema repository (long-term)
15
Stitching data together
16
Ancestry log collection details
Each server
• 10 rolling logs
• Scraper process
Validate your data
collection infrastructure
• Auto incrementing count in every log
message
• Count on Framework side (sender) and
count on Hadoop (receiver)
17
Local Server
Hard Drive
Single Server
Kafka
Scrapper
10 rolling files
Hadoop
Log Sender
Log Receiver
Ancestry moving forward
• Ancestry is not “done” - the journey continues
– Still evolving and changing
– My thinking and understanding has also changed
• Means we will embrace new technologies in the future
– Keep our eyes open and experiment
• This is affecting the entire organization
– Becoming more involved with Open Source and the
communities that support it
18
Top three things to remember
• First and foremost, understand your needs
– No clear right or wrong way
– Keep it simple because simple scales
• This is about Analytics and impacting the business
• Find a company that fits you and follow them:
– Netflix (cloud architecture, code for survival, simian army)
– Facebook (HBase)
– LinkedIn (Kafka, Samza, Azkaban)
19
byetman@ancestry.com
http://blogs.ancestry.com/techroots/
(Filter on Big Data or search for
“Adventures in Big Data” in the title)
Bill’s contact information
20
Bill Yetman
VP of Engineering at Ancestry

More Related Content

What's hot

From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
Mark Rittman
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
Mark Rittman
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
Adam Muise
 
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Jen Stirrup
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
Mark Rittman
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the Cloud
RightScale
 
Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analytics
Imply
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
David Giard
 
An introduction to Big Data
An introduction to Big DataAn introduction to Big Data
An introduction to Big Data
ForwardSprint
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Mark Rittman
 
Next Big Thing In IT Space
Next Big Thing In IT SpaceNext Big Thing In IT Space
Next Big Thing In IT Space
Ahsan Shamsudeen
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
Mark Kromer
 
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?
DataWorks Summit/Hadoop Summit
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
Sofian Hadiwijaya
 
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon
 

What's hot (20)

From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the Cloud
 
Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analytics
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
An introduction to Big Data
An introduction to Big DataAn introduction to Big Data
An introduction to Big Data
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
 
Next Big Thing In IT Space
Next Big Thing In IT SpaceNext Big Thing In IT Space
Next Big Thing In IT Space
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02
 

Viewers also liked

Private Cloud Delivers Big Data in Oil & Gas v4
Private Cloud Delivers Big Data in Oil & Gas v4Private Cloud Delivers Big Data in Oil & Gas v4
Private Cloud Delivers Big Data in Oil & Gas v4
Andy Moore
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
Donald Miner
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
William Yetman
 
Enable breakthrough in Parkinson disease research- Ido Karavany-
Enable breakthrough in Parkinson disease research- Ido Karavany-Enable breakthrough in Parkinson disease research- Ido Karavany-
Enable breakthrough in Parkinson disease research- Ido Karavany-
Spark Summit
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
Jim Haughwout
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Spark Summit
 
Big Data Application Architectures - IoT
Big Data Application Architectures - IoTBig Data Application Architectures - IoT
Big Data Application Architectures - IoT
DataWorks Summit/Hadoop Summit
 
Internet of Things (IoT) and Big Data
Internet of Things (IoT) and Big DataInternet of Things (IoT) and Big Data
Internet of Things (IoT) and Big Data
Guido Schmutz
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
Isheeta Sanghi
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
Zaloni
 

Viewers also liked (13)

Private Cloud Delivers Big Data in Oil & Gas v4
Private Cloud Delivers Big Data in Oil & Gas v4Private Cloud Delivers Big Data in Oil & Gas v4
Private Cloud Delivers Big Data in Oil & Gas v4
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
Enable breakthrough in Parkinson disease research- Ido Karavany-
Enable breakthrough in Parkinson disease research- Ido Karavany-Enable breakthrough in Parkinson disease research- Ido Karavany-
Enable breakthrough in Parkinson disease research- Ido Karavany-
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
 
Big Data Application Architectures - IoT
Big Data Application Architectures - IoTBig Data Application Architectures - IoT
Big Data Application Architectures - IoT
 
Internet of Things (IoT) and Big Data
Internet of Things (IoT) and Big DataInternet of Things (IoT) and Big Data
Internet of Things (IoT) and Big Data
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Utah Big Mountain Big Data Baby Steps (4-12-2014) Final

5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
Caserta
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
Blackvard
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
Qubole
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
Steve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
JAX London
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Eric Baldeschwieler
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
Rob Winters
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
Bob Hardaway
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Caserta
 
Big Data
Big DataBig Data
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
Vishwajeet Jadeja
 
Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management Orchestra
MongoDB
 
Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management Orchestra
Cloudera, Inc.
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Christopher Curtin
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data Lakes
Kiran Kamreddy
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Lucidworks
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera, Inc.
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Pentaho
 

Similar to Utah Big Mountain Big Data Baby Steps (4-12-2014) Final (20)

5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
 
Big Data
Big DataBig Data
Big Data
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management Orchestra
 
Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management Orchestra
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data Lakes
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 

Utah Big Mountain Big Data Baby Steps (4-12-2014) Final

  • 1. 1 Big Data, Baby Steps “What Every Leader Should Consider When Starting a Big Data Initiative” April 12, 2014
  • 2. Goal for this presentation “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it...” - Dan Ariely on Facebook Jan 6, 2013 and “others” 2 Why Me? Why Ancestry? • Established consumer facing Web company looking to leverage our data • Started with Hadoop and HBase in 2012 on AncestryDNA • When we started, I looked for guidance – it was missing • Learn from us: what works, what didn’t, how to adjust
  • 3. Agenda • What to consider before you start • Understand the Hadoop ecosystem – What pieces is Ancestry using and why? – Big Data architecture at Ancestry • Hadoop distributions • Big Data consultants • How to build your team(s) • Custom logs – Other companies and Ancestry specifics • Top three things to remember 3
  • 4. Gartner new technology hype cycles Where is Big Data (and Big Data Analytics) on this curve? 4 Source: Gartner August 2013
  • 5. What to consider before you start • Big Data, Business Intelligence, and Analytics are tied – Analytics is an umbrella term that represents the entire ecosystem needed to turn data into actions • Understand your “data” – Web click stream data, sales transactions, advertising data, fraud detection, sensor data, social data, etc. • Visualize your final goal and work backwards – Imagine (prototype) the dashboards, analytics, and actions that will be available • Deliver value to the business at each step – “Goal of analytics is not to produce actionable insights; the goal is to produce results.” Ken Rudin 5
  • 6. Understand the Hadoop ecosystem • Hadoop 2.0 and HDFS (Yarn) • Workflow • NoSQL • Data Organization • Log collection • Near Real-Time Stream Processing • NFS File System on HDFS 6
  • 7. What are the pieces Ancestry is using? We use or plan to use: Yarn and Ambari Forensics on log data: Visualization: (Graphs + Deep Zoom) 7
  • 8. Visualization Company that used traditional “Cubes” and Excel – Business Intelligence/Data Warehouse world has moved beyond cubes – Great product that didn’t work for us – People went back to using Excel – In two weeks, 30 people created 120+ dashboards and reports – Tied to an MPP Data Warehouse is changing our company – Created the “Wild, wild, west” - fixing with a blessed portal 8
  • 9. Hadoop distributions • Open Source, Active Community, Large Eco-System of Projects, requires more internal knowledge and support • First Distribution, Large “War Chest” (Cash Investment), Impala, and the Cloudera Console • Custom file system (API equivalent to HDFS) that improves performance, custom Hbase implementation, High Availability Features • Closest to Apache Hadoop, tested on Yahoo!’s 7000 node cluster before being released. • Several Cloud options: Google and Amazon. Quick and easy to get going. Great way to experiment and learn. Watch your data storage costs 9
  • 10. Typical Big Data architecture Cassandra Repo Users Properties User Properties User Segments Rules Defines Samza Stream Processing Stream A Stream B Stream C Kafka Stream Repo Runs on Hadoop System of Records Simple ETL Raw Data Global Properties & Models Marketing Segmentation and Targeting Managment Expose to the Web Site User Facing Stacks and Services Log Forwarder Kafka Producer EDW (MPP) Simple ELT MapReduce ETL Designs Actions Feeds 10
  • 11. Ancestry system diagram 11 Hadoop System of Records Dogwood ELT User 360 Services Initiative Kafka Log Forwarder EDW ParAccel MapReduce ETL Splunk Alternative Initiative Operation Monitoring Reporting Initiative Stream Kafka Samza Stream Processing Stream A Stream B Stream C Notification Service Mirror .Net Stack Java Stack JVM stack Vert.X stack Node.js Stack Python Stack Kafka Producer Aspect Aspect Aggr ETL ETL Kafka Actions Feeds Production HadoopTableau Elastic Search Kibana
  • 12. How to organize and build your team(s) • Hiring vs. training smart developers in your organization – Training ▫ Self-starters who can train themselves ▫ Online training that is free or with minimal cost ▫ Paid training for specific technologies – Promote your technology and people will reach out to you ▫ Bit of a chicken and egg problem • Key roles for the team – Developers who understand operations – Hadoop engineers – Team leaders and managers 12
  • 13. Big Data consultants • Lots of them, charging lots of money • Not all of them are created equal • Prefer consultants who are vendor agnostic • Find consultants who have experience in what you want to do • Check references 13
  • 14. Companies working with custom logs 14 • Scribe, Scuba, Hive, and Hadoop as the data warehouse infrastructure. Run over 10K Hive scripts daily to crunch log data. Analyst on each team to make sure logging is correct. • Uses a very simple interface similar to log4j to log data. How to keep this accurate? • Tried Scribe. Implemented Kafka and Avro to collect log data. Use a binary format with a schema registry. • Recently open sourced their log collecting infrastructure (Suro – Data Pipeline). “Used to be a web site that occasionally logged data. Now we’re a logging engine that occasionally serves as a web site.”
  • 15. Collecting custom logs at Ancestry • Framework piece with a “Logging Aspect” – Logging is a cross cutting concern – Avoid breaking changes – Annotations for parameter names (normalization layer) • Defined Big Data headers that must be present in every log (User ID, Anonymous ID, Session, Request ID, Client) – Stitch data together – Partitioned in Hive by day/month/year – JSON payload – Validate messages sent vs. messages received – Schema repository (long-term) 15
  • 17. Ancestry log collection details Each server • 10 rolling logs • Scraper process Validate your data collection infrastructure • Auto incrementing count in every log message • Count on Framework side (sender) and count on Hadoop (receiver) 17 Local Server Hard Drive Single Server Kafka Scrapper 10 rolling files Hadoop Log Sender Log Receiver
  • 18. Ancestry moving forward • Ancestry is not “done” - the journey continues – Still evolving and changing – My thinking and understanding has also changed • Means we will embrace new technologies in the future – Keep our eyes open and experiment • This is affecting the entire organization – Becoming more involved with Open Source and the communities that support it 18
  • 19. Top three things to remember • First and foremost, understand your needs – No clear right or wrong way – Keep it simple because simple scales • This is about Analytics and impacting the business • Find a company that fits you and follow them: – Netflix (cloud architecture, code for survival, simian army) – Facebook (HBase) – LinkedIn (Kafka, Samza, Azkaban) 19
  • 20. byetman@ancestry.com http://blogs.ancestry.com/techroots/ (Filter on Big Data or search for “Adventures in Big Data” in the title) Bill’s contact information 20 Bill Yetman VP of Engineering at Ancestry