Hadoop is Happening
May 1, 2014
Syncsort Confidential and Proprietary - do not copy or distribute
Agenda
Hadoop Evolution
Use Cases
The Hadoop Ecosystem, ...
Syncsort Confidential and Proprietary - do not copy or distribute
Our Guest – Chida from OpenOsmium
20+ years of Enterpris...
Syncsort Confidential and Proprietary - do not copy or distribute
EVOLUTION OF HADOOP
4
Syncsort Confidential and Proprietary - do not copy or distribute
Evolution of Hadoop – Data Volumes are Growing
5
Syncsort Confidential and Proprietary - do not copy or distribute
Evolution of Hadoop – Key Events
6
Next?2000 2004
Search...
Syncsort Confidential and Proprietary - do not copy or distribute
Why Hadoop As a Data Management Platform?
The Reliabilit...
Syncsort Confidential and Proprietary - do not copy or distribute
The Economics of Data
8
Cost of managing 1TB of data
Mai...
Syncsort Confidential and Proprietary - do not copy or distribute
Hadoop - The Big Picture
9
Unified computation
provided ...
Syncsort Confidential and Proprietary - do not copy or distribute
MapReduce – Football Stadium Analogy
10
Syncsort Confidential and Proprietary - do not copy or distribute
Yesterday’s Architecture
11
Syncsort Confidential and Proprietary - do not copy or distribute
Tomorrow’s Data Architecture
12
Syncsort Confidential and Proprietary - do not copy or distribute
HADOOP USE CASES
13
Syncsort Confidential and Proprietary - do not copy or distribute
Hadoop Use Cases
14
Data Lake
Offload Mainframe Data
& B...
Syncsort Confidential and Proprietary - do not copy or distribute
Hadoop Use Cases
A Roadmap for Hadoop Success
– Offload ...
Syncsort Confidential and Proprietary - do not copy or distribute
Sample Use Case: Offload
Phase III:
Optimize & Secure
Ph...
Syncsort Confidential and Proprietary - do not copy or distribute
Phase 2: Deliver ‘Next-generation’ Applications
Advanced...
Syncsort Confidential and Proprietary - do not copy or distribute
Use Cases Across Industries
Vertical Refine Explore Enri...
Syncsort Confidential and Proprietary - do not copy or distribute
IMPLEMENTATION & SKILLSET
CHALLENGES
19
Syncsort Confidential and Proprietary - do not copy or distribute
Overview of Hadoop Challenges
Hardware??
Skills??
Traini...
Syncsort Confidential and Proprietary - do not copy or distribute
Example 1 - ETL in Hadoop
21
COLLECT PROCESS DISTRIBUTE
...
Syncsort Confidential and Proprietary - do not copy or distribute 22
Images: http://monkeestv.tripod.com/BatMonkee/
Percep...
Syncsort Confidential and Proprietary - do not copy or distribute
Reality
Example 2 – Mainframe Data Ingestion
23
Every Ch...
Syncsort Confidential and Proprietary - do not copy or distribute
Big Data Team
24
Senior Linux/Unix Admin Hadoop Adminis...
Syncsort Confidential and Proprietary - do not copy or distribute
Enterprise Adoption Approach
Agile
Ideal Use Case for th...
Syncsort Confidential and Proprietary - do not copy or distribute
THE HADOOP ECOSYSTEMS –
FROM OPEN SOURCE TO VENDOR TOOLS...
Syncsort Confidential and Proprietary - do not copy or distribute
Hadoop Distributions
27
Syncsort Confidential and Proprietary - do not copy or distribute 28
Vendor Landscape
Distributions / Platforms
Data Integ...
Syncsort Confidential and Proprietary - do not copy or distribute
REAL-WORLD CASE STUDIES
29
Syncsort Confidential and Proprietary - do not copy or distribute
Understanding Mainframe Data at Major US Bank
30
Custome...
Syncsort Confidential and Proprietary - do not copy or distribute
Social Security Administration
The Challenge:
– The SSA ...
Syncsort Confidential and Proprietary - do not copy or distribute
Optimizing the EDW at Large Teradata Customer
32
• Offlo...
Syncsort Confidential and Proprietary - do not copy or distribute
Log File Processing
33
Syncsort Confidential and Proprietary - do not copy or distribute
Video - Placemeter
34
http://vimeo.com/69091237
Syncsort Confidential and Proprietary - do not copy or distribute
What to do next
No one is impartial, but it’s still wort...
Syncsort Confidential and Proprietary - do not copy or distribute
Why Hadoop As a Data Management Platform?
The Reliabilit...
Syncsort Confidential and Proprietary - do not copy or distribute
Big Data – Projects
37
Upcoming SlideShare
Loading in …5
×

Hadoop is Happening

812 views

Published on

Published in: Software, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
812
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
35
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hadoop is Happening

  1. 1. Hadoop is Happening May 1, 2014
  2. 2. Syncsort Confidential and Proprietary - do not copy or distribute Agenda Hadoop Evolution Use Cases The Hadoop Ecosystem, from open source to vendor solutions Tooling, implementation and skillset challenges Real-World Case Studies Future of Hadoop Q&A 2
  3. 3. Syncsort Confidential and Proprietary - do not copy or distribute Our Guest – Chida from OpenOsmium 20+ years of Enterprise Application Development Experience Focused on Big Data & Cloud Founder of Big Data Solution Provider – OpenOsmium DC Tech Community Organizer of Meetups – Google Developer Group, Tech Breakfast, NoVA Hadoop User Group Open Source, Big Data and Cloud Advocate 703-568-7426, chida@openosmium.com 3
  4. 4. Syncsort Confidential and Proprietary - do not copy or distribute EVOLUTION OF HADOOP 4
  5. 5. Syncsort Confidential and Proprietary - do not copy or distribute Evolution of Hadoop – Data Volumes are Growing 5
  6. 6. Syncsort Confidential and Proprietary - do not copy or distribute Evolution of Hadoop – Key Events 6 Next?2000 2004 Search Engine Problem @ Google 3 White Papers: GFS, MapReduce, BigTable MapReduce: Simplified Data Processing on Large Clusters Yahoo! HDFS, MapReduce, Hbase 2008 2010 2012 2013 MapR Hortonworks HHadoop 2.0 Cloudera
  7. 7. Syncsort Confidential and Proprietary - do not copy or distribute Why Hadoop As a Data Management Platform? The Reliability of a Mainframe, The Massive Performance at Scale of an MPP appliance, The Storage Capacity of a SAN, All at a Disruptively Low Price Point 7
  8. 8. Syncsort Confidential and Proprietary - do not copy or distribute The Economics of Data 8 Cost of managing 1TB of data Mainframe EDW Hadoop $20,000 – $100,000 $15,000 – $80,000 $250 – $2,000 Scalability Performance Reliability Agility Skills Supply But there’s more…
  9. 9. Syncsort Confidential and Proprietary - do not copy or distribute Hadoop - The Big Picture 9 Unified computation provided by MapReduce distributed computing framework Unified storage provided by distributed file system called HDFS Commodity Hardware Hardware contains bunch of disks and cores Physical Logical Storage Computation
  10. 10. Syncsort Confidential and Proprietary - do not copy or distribute MapReduce – Football Stadium Analogy 10
  11. 11. Syncsort Confidential and Proprietary - do not copy or distribute Yesterday’s Architecture 11
  12. 12. Syncsort Confidential and Proprietary - do not copy or distribute Tomorrow’s Data Architecture 12
  13. 13. Syncsort Confidential and Proprietary - do not copy or distribute HADOOP USE CASES 13
  14. 14. Syncsort Confidential and Proprietary - do not copy or distribute Hadoop Use Cases 14 Data Lake Offload Mainframe Data & Batch Workloads Machine Data Cyber Security Fraud Detection Offload ELT from Data WarehouseClickstream / Weblogs, EMR Social Media Data Geo Spatial Analyzing Video and Audio Analytics Real-Time Processing Predictive Analytics Unstructured Data Active Archive Multi-media Leverage “Dark Data” Sentiment Analysis Enterprise Data Hub
  15. 15. Syncsort Confidential and Proprietary - do not copy or distribute Hadoop Use Cases A Roadmap for Hadoop Success – Offload batch & ELT workloads from data warehouse and mainframe systems into Hadoop – Develop and active archive, shed light on dark data – Build your Enterprise Data Hub (Data Lake!) – Leverage new data sources – Extend BI with data discovery & exploration – Deliver next-generation analytics 15
  16. 16. Syncsort Confidential and Proprietary - do not copy or distribute Sample Use Case: Offload Phase III: Optimize & Secure Phase II: Offload Phase I: Identify • Identify data & workloads most suitable for offload • Focus on those that will deliver maximum savings & performance • Access and move virtually any data to Hadoop with one tool • Easily replicate existing workloads in Hadoop using a graphical user interface • Deploy and optimize the new environment • Manage & secure all your data with business class tools 16
  17. 17. Syncsort Confidential and Proprietary - do not copy or distribute Phase 2: Deliver ‘Next-generation’ Applications Advanced – ‘Next-gen’ – Applications for Hadoop – Semi-structured data analytics • Clickstream/Weblog, Electronic Medical Records – Unstructured data analytics • video, audio, documents, text, social • Predictive modeling – Geospatial analysis – Real-Time Processing 17
  18. 18. Syncsort Confidential and Proprietary - do not copy or distribute Use Cases Across Industries Vertical Refine Explore Enrich Retail & Web • Log Analysis/Site Optimization • Loyalty Program Optimization • Brand and Sentiment Analysis • Market basket analysis • Dynamic Pricing • Session & Content Optimization • Product recommendation Telco • Customer profiling • Equipment failure prediction • Location based advertising Government • Threat Identification • Person of Interest Discovery • Mission work Finance • Risk Modeling & Fraud Identification • Trade Performance Analytics • Surveillance and Fraud Detection • Customer Risk Analysis • Real-time upsell, cross sales marketing offers Energy • Smart Grid: Production Optimization • Grid Failure Prevention • Smart Meters • Individual Power Grid Manufacturing • Supply Chain Optimization • Customer Churn Analysis • Dynamic Delivery • Replacement parts Healthcare • Electronic Medical Records (EMPI) • Clinical decision support • Clinical Trials Analysis • Insurance Premium Determination 18
  19. 19. Syncsort Confidential and Proprietary - do not copy or distribute IMPLEMENTATION & SKILLSET CHALLENGES 19
  20. 20. Syncsort Confidential and Proprietary - do not copy or distribute Overview of Hadoop Challenges Hardware?? Skills?? Training?? Rapid change of Hadoop Ecosystem? 20
  21. 21. Syncsort Confidential and Proprietary - do not copy or distribute Example 1 - ETL in Hadoop 21 COLLECT PROCESS DISTRIBUTE Sort JoinAggregate Copy Merge •FS Shell Put Command•Flume •Sqoop HARD •Pig •HiveQL•Java HARDER •Sqoop •FS Shell Get Command HARD
  22. 22. Syncsort Confidential and Proprietary - do not copy or distribute 22 Images: http://monkeestv.tripod.com/BatMonkee/ Perception: Just Call the Mainframe Guy… Example 2 – Mainframe Data Ingestion
  23. 23. Syncsort Confidential and Proprietary - do not copy or distribute Reality Example 2 – Mainframe Data Ingestion 23 Every Change = Time, Cost SMS Compression DB Tables, Flat Files Filtering , Reformatting Copy, Sort, Join, Aggregation EBCDIC to ASCII Cobol copybooks Call MF GuySMS Compression DB Tables, Flat Files Filtering , Reformatting Copy, Sort, Join, Aggregation EBCDIC to ASCII Cobol copybooks Call MF GuySMS Compression DB Tables, Flat Files Filtering , Reformatting Copy, Sort, Join, Aggregation EBCDIC to ASCII Cobol copybooks Image: bottletales.com
  24. 24. Syncsort Confidential and Proprietary - do not copy or distribute Big Data Team 24 Senior Linux/Unix Admin Hadoop Administrators Infrastructure Engineers Java Developers  Hadoop Developers Object Oriented Developers  Hadoop Developers Data Analysts Functional Users  Hadoop Analytics Users Project Managers! Chief Data Officer Executive Management
  25. 25. Syncsort Confidential and Proprietary - do not copy or distribute Enterprise Adoption Approach Agile Ideal Use Case for the company Proof-of-concept or Pilot Tech Heavy Aware of Available Options – Many.. Work with Solution Architects Infrastructure Analysis Security Options Testing.. Testing.. Integrating with current Stack Cost.. Cost.. Promises Vs Reality 25
  26. 26. Syncsort Confidential and Proprietary - do not copy or distribute THE HADOOP ECOSYSTEMS – FROM OPEN SOURCE TO VENDOR TOOLS 26
  27. 27. Syncsort Confidential and Proprietary - do not copy or distribute Hadoop Distributions 27
  28. 28. Syncsort Confidential and Proprietary - do not copy or distribute 28 Vendor Landscape Distributions / Platforms Data Integration/ETL Search Document Store Database / Data Warehouse Social Operational XML Database Graphs
  29. 29. Syncsort Confidential and Proprietary - do not copy or distribute REAL-WORLD CASE STUDIES 29
  30. 30. Syncsort Confidential and Proprietary - do not copy or distribute Understanding Mainframe Data at Major US Bank 30 Customer hit a wall after months of manual effort migrating Mainframe data • Difficult to find data errors. No Mainframe application logic that matches Copybook • Large and complex Copybooks • Depends on Mainframe team to provide data • Very manual-intensive ; inadequate documentation • Not scalable. Only a few Java + Mainframe experts could do the work • Easy to validate Copybooks and find data errors • Ability to pull data directly from Mainframe without relying on Mainframe team • No coding. No scripting. Easier to document, maintain & reuse • Enables developers with a broader set of skills to build complex migration jobs. +( ) 86-page copybook ?Weeks 4 hrs Before: Manual Effort After: DMX-h + CDH 86-page copybook 30
  31. 31. Syncsort Confidential and Proprietary - do not copy or distribute Social Security Administration The Challenge: – The SSA has an expensive problem with fraudulent claims for benefits, and they need more and better data to prevent and punish that fraud. The Office of the Inspector General for the SSA reports that: – “Nationally, in Fiscal Year 2011, there were more than 103,000 allegations of Social Security fraud, with more than 7,000 criminal investigations resulting in 1,374 convictions and more than $410 million in recoveries, fines, restitution, judgments, settlements, and savings.” Why Hadoop? – Data Processing Time – 30 hrs on the MF and PoC cluster completed in 2 hrs – Accuracy – Obituary data is likely more accurate over social media than current death file 31
  32. 32. Syncsort Confidential and Proprietary - do not copy or distribute Optimizing the EDW at Large Teradata Customer 32 • Offload ELT processing from Teradata into CDH using DMX-h • Implement flexible architecture for staging and change data capture • Ability to pull data directly from Mainframe • No coding. Easier to maintain & reuse • Enable developers with a broader set of skills to build complex ETL workflows0 100 200 300 400 ElapsedTime(m) HiveQL 360 min DMX-h 15 min 0 4 8 12 16 Development Effort (Weeks) DMX-h 4 Man weeks HiveQL 12 Man weeks Impact on Loans Application Project:  Cut development time by 1/3  Reduced complexity. From 140 HiveQL scripts to 12 DMX-h graphical jobs  Eliminated need for Java user defined functions  24x faster! +
  33. 33. Syncsort Confidential and Proprietary - do not copy or distribute Log File Processing 33
  34. 34. Syncsort Confidential and Proprietary - do not copy or distribute Video - Placemeter 34 http://vimeo.com/69091237
  35. 35. Syncsort Confidential and Proprietary - do not copy or distribute What to do next No one is impartial, but it’s still worth talking to: – Vendors – Industry Analysts – Industry Peers – People at Meetups – Practitioners like Chida 35
  36. 36. Syncsort Confidential and Proprietary - do not copy or distribute Why Hadoop As a Data Management Platform? The Reliability of a Mainframe, The Massive Performance at Scale of an MPP appliance, The Storage Capacity of a SAN, All at a Disruptively Low Price Point 36
  37. 37. Syncsort Confidential and Proprietary - do not copy or distribute Big Data – Projects 37

×