When stars align: studies in data quality, knowledge graphs, and machine lear...
American family hadoop journey, uw ebc sig meeting, april 2015
1. American Family Hadoop Journey
Case Study Discussion
UW E-Business Consortium
Business Intelligence Special Interest Group
April 2015
1
2. Objective
American Family Hadoop Journey
Give you a firsthand perspective on
• Why Hadoop?
• How we approached it
• What has worked and what hasn’t
2
3. Agenda
American Family Hadoop Journey
• Background & Context
• Organizational Frameworks
• Architectural Considerations
3
4. Background & context
American Family Hadoop Journey
Hadoop is a team sport
Our adoption effort has at times included 30+ staff
members
Infrastructure
Software developers
Architects
Business experts
Consultants
• My Role
– For American Family:
• Enterprise Information Architecture
– For Hadoop Team:
• Visionary direction & strategies for experimentation & project work
4
5. Existing landscape
American Family Hadoop Journey
American Family has long history with BI
• Mainframe files
function-specific DW
EDW
• Leading BI tools for various roles
• Function-specific analysis
• Standard reporting
• Ad hoc query & reporting
• Statistical analysis & modeling
5
6. Why Hadoop
American Family Hadoop Journey
• New business initiatives & applications
• Growing demand for flexible analysis
platforms
• Increasing interest in analyzing
“unstructured” data
6
8. Why Hadoop is different
American Family Hadoop Journey
• Data storage flexibility
• Programming language flexibility
– Java, python, R, SQL, APIs
• Processing flexibility
– Batch, interactive, in-memory
– Flexible workload queues
• Storage-dense to compute-dense spectrum
8
9. Expected benefits / challenges
American Family Hadoop Journey
Benefits
• Increased access to data
otherwise unavailable
• Greater collaboration
between IT & business
technical experts
• Greater capacity and
processing power at
lower cost than MPP
data warehouses
9
Challenges
• Rapid pace of technical
change
• Limited availability of
skilled staff
• Less optimized for
performance
• Challenging
administration
• Trial & error
10. A Journey Needs a Map
American Family Hadoop Journey
• Destination
– Data lake framework
– Technical architecture & principles
• Path & Traveling Companions
– Adoption framework
– Business & technology tracks
– Cross-functional team
– Organizational “architecture” & principles
• Mile Markers
– Use case categories
– Roadmap
10
11. Destination: Data Lake
American Family Hadoop Journey
Conceptual:
Enable the business with a rich and flexible environment able to store and
analyze all of the data they are interested in using.
Technical:
The data lake is a platform capable of storing and processing the largest
and most varied datasets that can be useful for the enterprise. The data
lake supports the following capabilities:
– Capture and store raw data at scale for a low cost
– Store many types of data in the same repository
– Perform transformations on the data to support specific analysis or
operational needs
– Define the structure through which the data should be interpreted at the
time it is used
– Perform many types of data processing rather than just SQL
11
12. Conceptual:
Enable the business with a rich and flexible environment able to store and
analyze all of the data they are interested in using.
Technical:
The data lake is a platform capable of storing and processing the largest
and most varied datasets that can be useful for the enterprise. The data
lake supports the following capabilities:
– Capture and store raw data at scale for a low cost
– Store many types of data in the same repository
– Perform transformations on the data to support specific analysis or
operational needs
– Define the structure through which the data should be interpreted at the
time it is used
– Perform many types of data processing rather than just SQL
Destination: Data Lake
American Family Hadoop Journey
12
14. Path: Adoption framework
American Family Hadoop Journey
14
http://www.gartner.com/it/content/2604400/2604421/december_12_big_data_road_map_ssicular.pdf?userId=61955890
15. Path: Dual tracks
American Family Hadoop Journey
• Parallel business and technology
tracks
• Technology Track
– Technology team explores distribution options
– Realistic use cases drive proofs of concept
– Vendors onsite for 4-6 week experiments
• Business Track
– Focus on business extension use cases first
– Early adopter team received formal training and completes hands-
on experimentation
• Principles
– Business requirements drive expansion
– Market “buzz” and business vision drive experimentation
15
16. Organizational architecture
American Family Hadoop Journey
16
Core
Technology
Team
Manager
Technical
SMEs
Core
Business
Research
Team
Business
SMEs
Steering
Committee Hadoop
User
Group
Many informal
communication channels
• Early adopters
• Technically savvy
• Agile
• 1st year
Dedicated matrix,
• 2nd year
Integrated
17. Mile Markers: Use case categories
American Family Hadoop Journey
17
Operational Efficiency
• Expense reduction
• Cycle-time reduction
• Quality improvement
Information Advantage
• Knowledge
• Insight
• Prediction
Long-term
data
retention
Moving data that must be retained in an
accessible format to a lower-cost
platform
ETL Offload
(Cleanse, Conform,
Integrate)
Completing repeated data prep (ETL)
steps on a lower-cost, or more highly
parallel, platform
EDW
Optimization
Balancing cost and performance of data
storage and query execution by creating
a logical data warehouse spanning SMP,
MPP, and Hadoop
Staging for
Data
Exploration
Full (360)
View
Connect data of all types to create a
complete perspective related to
customers, products, processes, etc.
Data Science
Exploring / Processing data using
advanced statistical and test processing
algorithms to “automate” insight
discovery.
Quickly making data accessible for
profiling, visual exploration, etc. with low
IT investment
18. Mile Markers: Roadmap
American Family Hadoop Journey
18
The roadmap is an intersection between
• Time (across the top)
• Category (down the left)
Each cell contains the kinds of work that need to be
demonstrated for a category in that time period.
Milestones at the top provide a way to indicate when
work across all of the categories align to produce an
observable capability
19. Architecture choices to consider
American Family Hadoop Journey
• Analysis-specific clusters vs. the “data lake” concept
• Open source, supported distribution or proprietary
platforms
• Native tools vs. licensed accelerators
• SQL, No SQL, Search
• Metadata
• Workload balancing strategy
• Backup & disaster recovery approaches
19
20. Lessons Learned
American Family Hadoop Journey
• Cross functional team is critical
• Work through data governance processes
early
• Understand the state of the art with respect
to handling sensitive data
• Engage “business programmers” early
• Expect BI tools to lag in integrating well
20