SlideShare a Scribd company logo
© 2012 Datameer, Inc. All rights reserved.
© 2012 Datameer, Inc. All rights reserved.
Hadoop as a Data Hub:
A Sears Case Study
© 2012 Datameer, Inc. All rights reserved.
About our Speaker!
Phil Shelley

!
Dr. Shelley is CTO at Sears Holdings
Corporation (SHC), leading IT Operations
and is focusing on the modernization of IT
across the company. !
!
Phil is also CEO of Metascale, a subsidiary
of Sears Holdings. Metascale is an IT
managed Services Company that makes Big
Data easy by designing, delivering and
operating Hadoop-based solutions for
Analytics, Mainframe Migration and
massive-scale processing, integrated into
the customers’ Enterprise.!
© 2012 Datameer, Inc. All rights reserved.
About our Speaker!
Stefan Groschupf!
!
Stefan Groschupf is the co-founder and CEO of
Datameer and one of the original contributors to
Nutch, the open source predecessor of Hadoop, !
!
Prior to Datameer, Stefan was the co-founder
and CEO of Scale Unlimited, which implemented
custom Hadoop analytic solutions for HP, Sun,
Deutsche Telekom, Nokia and others. Earlier,
Stefan was CEO of 101Tec, a supplier of
Hadoop and Nutch-based search and text
classification software to industry-leading
companies such as Apple, DHL and EMI Music.
Stefan has also served as CTO at multiple
companies, including Sproose, a social search
engine company.!
Hadoop as a Data Hub
a new approach to data management
Dr. Phil Shelley
CTO Sears Holdings
CEO MetaScale
The
Challenge
Data
Volume /
Retention
Batch
Window
Limits
Escalating
IT Costs
Scalability
Ever
Evolving
Business
ETL
Complexity
/ Costs
Data
Latency /
Redundancy
Tight IT
Budgets
Challenges & Trends
2
Constant pressure to lower costs, deliver faster, migrate to real time
and answer more difficult questions…
Batch Real-Time→
Proprietary Open Source→
Capital Cloud Expense→
Heavy Iron Commodity→
Linear Parallel Processing→
Copy and Use Source Once & Re-Use→
Costs Down→
Power Up→
What is a Data Hub
A single, consolidated, fully
populated data archive that
gives unfettered user access to
analyze and report on data, with
appropriate security, as soon as
the data is created by the
transactional or other source
system
Why a Data Hub
• Most data latency is removed
• Users and analysts are put in a self-service mode
• The concept of a “data cube” is unnecessary
• Analysis at the lowest level – No need to run at the segment level
• Any question can be asked
• Business users and analysts have unrestricted ability to explore
• Correlation of any data set is immediately possible
• Significant reduction in reporting and analysis times
– Time to source the data
– Time for users to gain access to the data
• Reduction in IT labor ….
– Source Once – Use Many Times
• Data is Copied from source systems via ETL
• Sub-sets of data are captured
– Too expensive to keep all detail
– Takes too long to ETL all data fields from sources
• Each use of data generates more unique ETL jobs
• Data is segmented to reduce query times
• Cubes or views are generated to improve analysis speed
• Disparate data silos required ETL before users have access
• Data warehouse costs and performance limitations force
archiving and data truncation
• Tends to lead to different versions of “truth”
• Time lag or latency from data generation to use
The Traditional Approach
Benefits - Hadoop as a Data Hub
• All data is available
– All history
– All detail
• No need to filter, segment or cube before use
• Data can be consumed almost immediately
• No need to silo into different databases to
accommodate performance limitations
• Users do not require IT to ETL data before use
• Security is applied via Datameer profiles
• User self-service is a reality
Prerequisites
• An Enterprise data architecture that has a Data
Hub as a foundation
• Data sourcing must be controlled
• Metadata must be created for data sources
• A leader with the vision and capability to drive
• Willing business users to pilot and coach others
• A sustained strategy to Enterprise Data
Architecture and governance
• A carefully designed Hadoop data layer
architecture
Key Concepts
• A Data Hub is now reality
• Drives lower costs and reduces delays
• Time to value for data is reduced
• Business users and analysts are empowered
• The most important:
– Source Once – Re-use Many Times
– Source everything
– Retain everything
o ETL complexity is needed no-longer – DATA HUB
– Source Once – Re-Use many times
– ETL is transformed to ELTTTTTT with lower data latency
– Consume data in-place with Datameer
o ETL-induced data latency is largely eliminated
– Analysis is routinely possible within minutes of data creation
o Long-running overnight workload on Legacy Systems
– Can be eliminated and executed at any time
– Run times are a fraction of the original clock-time
o Batch processing on mainframes or other conventional batch
– Moved to Hadoop
– Run 10, 50, even 100 times faster.
o Intelligent Archive
– Put your archives/tape data on Hadoop and make it Intelligent
– Archive with the ability to run analytics or join it with other data
o Modernize Legacy
– Mainframe MIPs reduction has very attractive ROI
– Move Data Warehouse workload – Reduce Cost – Go Faster
Key Learning
Sample Reports - Datameer
© 2012 Datameer, Inc. All rights reserved.
Questions and Answers!
© 2012 Datameer, Inc. All rights reserved.
Online Resources
!  Try Datameer: www.datameer.com!
!  Visit Metascale: www.metascale.com!
!  Follow us on Twitter @datameer & @BigDataMadeEasy!
!

More Related Content

More from Datameer

Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data
Datameer
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
Datameer
 
How to Avoid Pitfalls in Big Data Analytics Webinar
How to Avoid Pitfalls in Big Data Analytics WebinarHow to Avoid Pitfalls in Big Data Analytics Webinar
How to Avoid Pitfalls in Big Data Analytics Webinar
Datameer
 
Webinar - Introducing Datameer 4.0: Visual, End-to-End
Webinar - Introducing Datameer 4.0: Visual, End-to-EndWebinar - Introducing Datameer 4.0: Visual, End-to-End
Webinar - Introducing Datameer 4.0: Visual, End-to-End
Datameer
 
Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User
Datameer
 
Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?
Datameer
 
Why Use Hadoop?
Why Use Hadoop?Why Use Hadoop?
Why Use Hadoop?
Datameer
 
Online Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarOnline Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics Webinar
Datameer
 
Instant Visualizations in Every Step of Analysis
Instant Visualizations in Every Step of AnalysisInstant Visualizations in Every Step of Analysis
Instant Visualizations in Every Step of Analysis
Datameer
 
Customer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data AnalyticsCustomer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data Analytics
Datameer
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics?
Datameer
 
Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?
Datameer
 
Fight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsFight Fraud with Big Data Analytics
Fight Fraud with Big Data Analytics
Datameer
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & Hadoop
Datameer
 
Lean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use CaseLean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use Case
Datameer
 
The Economics of SQL on Hadoop
The Economics of SQL on HadoopThe Economics of SQL on Hadoop
The Economics of SQL on Hadoop
Datameer
 
Top 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataTop 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big Data
Datameer
 
Best Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by DatameerBest Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by Datameer
Datameer
 
How to do Data Science Without the Scientist
How to do Data Science Without the ScientistHow to do Data Science Without the Scientist
How to do Data Science Without the Scientist
Datameer
 
How to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited DataHow to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited Data
Datameer
 

More from Datameer (20)

Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
 
How to Avoid Pitfalls in Big Data Analytics Webinar
How to Avoid Pitfalls in Big Data Analytics WebinarHow to Avoid Pitfalls in Big Data Analytics Webinar
How to Avoid Pitfalls in Big Data Analytics Webinar
 
Webinar - Introducing Datameer 4.0: Visual, End-to-End
Webinar - Introducing Datameer 4.0: Visual, End-to-EndWebinar - Introducing Datameer 4.0: Visual, End-to-End
Webinar - Introducing Datameer 4.0: Visual, End-to-End
 
Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User
 
Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?
 
Why Use Hadoop?
Why Use Hadoop?Why Use Hadoop?
Why Use Hadoop?
 
Online Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarOnline Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics Webinar
 
Instant Visualizations in Every Step of Analysis
Instant Visualizations in Every Step of AnalysisInstant Visualizations in Every Step of Analysis
Instant Visualizations in Every Step of Analysis
 
Customer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data AnalyticsCustomer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data Analytics
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics?
 
Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?
 
Fight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsFight Fraud with Big Data Analytics
Fight Fraud with Big Data Analytics
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & Hadoop
 
Lean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use CaseLean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use Case
 
The Economics of SQL on Hadoop
The Economics of SQL on HadoopThe Economics of SQL on Hadoop
The Economics of SQL on Hadoop
 
Top 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataTop 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big Data
 
Best Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by DatameerBest Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by Datameer
 
How to do Data Science Without the Scientist
How to do Data Science Without the ScientistHow to do Data Science Without the Scientist
How to do Data Science Without the Scientist
 
How to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited DataHow to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited Data
 

Recently uploaded

Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
ScyllaDB
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 

Recently uploaded (20)

Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 

Sears Case Study: Hadoop as an Enterprise Data Hub

  • 1. © 2012 Datameer, Inc. All rights reserved. © 2012 Datameer, Inc. All rights reserved. Hadoop as a Data Hub: A Sears Case Study
  • 2. © 2012 Datameer, Inc. All rights reserved. About our Speaker! Phil Shelley
 ! Dr. Shelley is CTO at Sears Holdings Corporation (SHC), leading IT Operations and is focusing on the modernization of IT across the company. ! ! Phil is also CEO of Metascale, a subsidiary of Sears Holdings. Metascale is an IT managed Services Company that makes Big Data easy by designing, delivering and operating Hadoop-based solutions for Analytics, Mainframe Migration and massive-scale processing, integrated into the customers’ Enterprise.!
  • 3. © 2012 Datameer, Inc. All rights reserved. About our Speaker! Stefan Groschupf! ! Stefan Groschupf is the co-founder and CEO of Datameer and one of the original contributors to Nutch, the open source predecessor of Hadoop, ! ! Prior to Datameer, Stefan was the co-founder and CEO of Scale Unlimited, which implemented custom Hadoop analytic solutions for HP, Sun, Deutsche Telekom, Nokia and others. Earlier, Stefan was CEO of 101Tec, a supplier of Hadoop and Nutch-based search and text classification software to industry-leading companies such as Apple, DHL and EMI Music. Stefan has also served as CTO at multiple companies, including Sproose, a social search engine company.!
  • 4. Hadoop as a Data Hub a new approach to data management Dr. Phil Shelley CTO Sears Holdings CEO MetaScale
  • 5. The Challenge Data Volume / Retention Batch Window Limits Escalating IT Costs Scalability Ever Evolving Business ETL Complexity / Costs Data Latency / Redundancy Tight IT Budgets Challenges & Trends 2 Constant pressure to lower costs, deliver faster, migrate to real time and answer more difficult questions… Batch Real-Time→ Proprietary Open Source→ Capital Cloud Expense→ Heavy Iron Commodity→ Linear Parallel Processing→ Copy and Use Source Once & Re-Use→ Costs Down→ Power Up→
  • 6. What is a Data Hub A single, consolidated, fully populated data archive that gives unfettered user access to analyze and report on data, with appropriate security, as soon as the data is created by the transactional or other source system
  • 7. Why a Data Hub • Most data latency is removed • Users and analysts are put in a self-service mode • The concept of a “data cube” is unnecessary • Analysis at the lowest level – No need to run at the segment level • Any question can be asked • Business users and analysts have unrestricted ability to explore • Correlation of any data set is immediately possible • Significant reduction in reporting and analysis times – Time to source the data – Time for users to gain access to the data • Reduction in IT labor …. – Source Once – Use Many Times
  • 8. • Data is Copied from source systems via ETL • Sub-sets of data are captured – Too expensive to keep all detail – Takes too long to ETL all data fields from sources • Each use of data generates more unique ETL jobs • Data is segmented to reduce query times • Cubes or views are generated to improve analysis speed • Disparate data silos required ETL before users have access • Data warehouse costs and performance limitations force archiving and data truncation • Tends to lead to different versions of “truth” • Time lag or latency from data generation to use The Traditional Approach
  • 9. Benefits - Hadoop as a Data Hub • All data is available – All history – All detail • No need to filter, segment or cube before use • Data can be consumed almost immediately • No need to silo into different databases to accommodate performance limitations • Users do not require IT to ETL data before use • Security is applied via Datameer profiles • User self-service is a reality
  • 10. Prerequisites • An Enterprise data architecture that has a Data Hub as a foundation • Data sourcing must be controlled • Metadata must be created for data sources • A leader with the vision and capability to drive • Willing business users to pilot and coach others • A sustained strategy to Enterprise Data Architecture and governance • A carefully designed Hadoop data layer architecture
  • 11. Key Concepts • A Data Hub is now reality • Drives lower costs and reduces delays • Time to value for data is reduced • Business users and analysts are empowered • The most important: – Source Once – Re-use Many Times – Source everything – Retain everything
  • 12. o ETL complexity is needed no-longer – DATA HUB – Source Once – Re-Use many times – ETL is transformed to ELTTTTTT with lower data latency – Consume data in-place with Datameer o ETL-induced data latency is largely eliminated – Analysis is routinely possible within minutes of data creation o Long-running overnight workload on Legacy Systems – Can be eliminated and executed at any time – Run times are a fraction of the original clock-time o Batch processing on mainframes or other conventional batch – Moved to Hadoop – Run 10, 50, even 100 times faster. o Intelligent Archive – Put your archives/tape data on Hadoop and make it Intelligent – Archive with the ability to run analytics or join it with other data o Modernize Legacy – Mainframe MIPs reduction has very attractive ROI – Move Data Warehouse workload – Reduce Cost – Go Faster Key Learning
  • 13. Sample Reports - Datameer
  • 14. © 2012 Datameer, Inc. All rights reserved. Questions and Answers!
  • 15. © 2012 Datameer, Inc. All rights reserved. Online Resources !  Try Datameer: www.datameer.com! !  Visit Metascale: www.metascale.com! !  Follow us on Twitter @datameer & @BigDataMadeEasy! !