The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

  • 402 views
Uploaded on

What's the origin of Big Data? What are the real life usage scenarios where Hadoop has been successfully adopted? How do you get started within your organizations?

What's the origin of Big Data? What are the real life usage scenarios where Hadoop has been successfully adopted? How do you get started within your organizations?

More in: Data & Analytics
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
402
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
7
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big Data in Practice: A Pragmatic approach to Adoption and Value creation Raj Nair Data Practitioner and Consultant
  • 2. Application Services • Enterprise Resource Planning (ERP) • eCommerce / eBusiness • Enterprise App Dev and ECM • Legacy Support, Systems Integration and Conversion Info Management • Business Intelligence and Analytics • Dashboards, Scorecards, Reporting • MDM & Data Modeling • Data Marts, ODS, ETL, Data Mining IT Infrastructure • IT Professional Services • Network Administration & Support • dB Admin & Maintenance • Hosting and Application Support Process & Governance • SDLC – Agile, TDD, TFD Iterative • Requirements Analysis, PMP, Change Management and Automated QA • Training & Knowledge Transition and Technical Documentation
  • 3. Content NOT FOR DISTRIBUTION: Property of Raj Nair Object Technology Solutions Inc. (OTSI) is a leading Information Technology (IT) Services and Solutions company founded in 1999. Clientele of Fortune 500 companies providing IT Solutions in the areas of SDLC, Information Management, Business Intelligence, ERP, eCommerce (B2B, B2C), Mobile, Enterprise Solutions, Middleware and Infrastructure. Technology Expertise and Experience SAP - Business Objects, ERP, Microsoft - SharePoint, .Net, SQL Server, Project Server, IBM - WebSphere, Cognos, Rational Suite, HP - Testing tools, PPM Data - Oracle, DB2, SQLServer, Teradata, OS – Windows, Unix (AIX, Linux, HP-UX) etc., Open Source, Java Certified Diversity Supplier in KS, MO and IL
  • 4. 1Big Data – The Original Use Case 2Mainstream Big Data 3Real World Use Cases and Applications 4Practical Adoption : Opportunity Identification 5Big Data 2.0 – What’s on the Horizon ? 6Conclusion
  • 5. An Open Source Engine The Year was 2002 …. Doug Cutting Mike Caferella
  • 6. Already Somebody’s Biz Problem • Problem of Capacity & Scale http://
  • 7. The Perfect Storm MapReduce Google File System BigTable
  • 8. MapReduce Google File System + =
  • 9. 1Big Data – The Original Use Case 2Mainstream Big Data 3Real World Use Cases and Applications 4Practical Adoption : Opportunity Identification 5Big Data 2.0 – What’s on the Horizon ? 6Conclusion
  • 10. Yes, But… We are not Google Sears: Dynamic Pricing AT&T, quantifying customer impact from failed cell towers Nokia: Holistic view of how users interact with apps across the world Zions Bancorp: Analyze 130 data sources for fraud Cerner: Detecting Health Risks
  • 11. Every Day Big Data Reaching scale-up limits on your server Represents tools, technologies, frameworks for storage and processing at scale Represents Opportunity
  • 12. Every Day Big Data Reaching scale-up limits on your server Represents tools, technologies, frameworks for storage and processing at scale Represents Opportunity
  • 13. Every Day Big Data Reaching scale-up limits on your server Represents tools, technologies, frameworks for storage and processing at scale Represents Opportunity
  • 14. Big Data 1.0 – The Hadoop Ecosystem Software library Framework for large scale distributed processing Ability to scale to thousands of computers
  • 15. Design Principles - Large Data Sets Classic Hadoop MapReduce – Batch Processing - Moving computation is cheaper than moving data - Hardware Failure, redundancy
  • 16. This not “That” Is Is Not A Software Framework (Storage/Compute) A Database Management System An appliance Batch Processing For real-time or interaction Write Once, Read Many Delete and Update or “ACID” Unassuming of data formats Imposing any schemas Open Source Lock In Made for commodity servers with local disks Meant to be run in virtualized environments
  • 17. What is this you call data? Unlearn current notion of “Data” Native Data Source
  • 18. HDFS Storage and Archival MapReduce Programming Library Crunch Data Pipeline processing HBase Real time access (low latency) Pig M/R Abstraction Hive Data Warehouse Sqoop Data Transfer Flume Data Streaming (High Latency) Data Processing Workload Management Data Movement
  • 19. Purpose Use it for HDFS Distributed Storage Raw data storage and archival Flume Data Movement Continuous Streaming into HDFS Sqoop Data Movement Data transfer from RDBMS to HDFS/HBase HBase Workload Mgmt Near real-time read/write access to large data sets Hive Workload Mgmt Analytical queries; data warehouse Map Reduce Data Processing Low level custom code for data processing Crunch Data Processing (Java) Coding M/R pipelines, aggregations Pig Data Processing Scripting language; similar to Crunch
  • 20. A Powerful Paradigm Storage Layer Query Engine Processing Engine Metadata Hadoop – Separate Layers Multiple Query Engines Data in Native format Oracle SQL Server Storage Query Storage Query Storage Query DB2 Tightly integrated Proprietary Stacks, cannot free your data
  • 21. 1Big Data – The Original Use Case 2Mainstream Big Data 3Real World Use Cases and Applications 4Practical Adoption : Opportunity Identification 5Big Data 2.0 – What’s on the Horizon ? 6Conclusion
  • 22. Opportunity… Transform Data Processing Exploration Information Enrichment Data Archival
  • 23. Data Processing Pipeline Several sources Varying Frequencies Varying Formats Quality check Validations, Scrubbing Transformations/Rules Prune app data sources Discard/Archive
  • 24. Data Processing Engine Data Warehouse Data Storage
  • 25. ETL Engine Data Warehouse Data Storage
  • 26. ELT Data Warehouse Data Storage
  • 27. From Source to Business Value Shoe-horning Relational fit Loading Archiving / Purging Biz Rules Validations Scrubbing Mapping Transforms Staging Distribution Prep Tuning Data stores Minutes/Hours Subset of Data Hours Reliability Sourcing Missed SLAs = Biz Frustration
  • 28. From Source to Business Value Significantly more data sources Highly scalable, significantly performant data processing New business value, Faster time to value
  • 29. Data Exploration Large reservoir of data Descriptive Statistics Central Tendencies Dispersion Visualization Surprise Me!
  • 30. Data Exploration Courtesy: Data Science Central http://www.datasciencecentral.com/profiles/blogs/r-hadoop-data-analytics-heaven
  • 31. Information Enrichment
  • 32. Information Enrichment
  • 33. Data Archival Recycle Policy
  • 34. Data Archival Storage in Native Format Redundancy , Replication Easily accessible, inexpensive
  • 35. 1Big Data – The Original Use Case 2Mainstream Big Data 3Real World Use Cases and Applications 4Practical Adoption : Opportunity Identification 5Big Data 2.0 – What’s on the Horizon ? 6Conclusion
  • 36. Practical Adoption Big Data Technologies don’t solve all problems Leveraging existing investments Complexities of existing systems
  • 37. Proof of Concept Use your own data – realistic results Focus on very specific pain points Know what you are going to measure
  • 38. Opportunity Identification Shoe-horning Relational fit Loading Archiving / Purging Biz Rules Validations Scrubbing Mapping Staging Distribution Prep Tuning Data stores Minutes/Hours Subset of Data Hours Reliability Sourcing
  • 39. Data Processing Engine Data Warehouse Data Storage
  • 40. Data Processing Engine Data Warehouse Data Storage Keep all your raw data Cheaper Hardware Low cost per byte $$ High value per byte Offload from RDBMS Improve scale, performance Leverage existing tools
  • 41. Hardware on a budget Master: - 12 cores - 32 GB RAM - 2 TB SATA Drives, 7.2K RPM Workers: - 4 Nodes - 12 cores - 16 GB RAM - 4 TB SATA Drives each, 7.2 PRM $5000 $5000 each 4-Port 10 Gig Switch - $1500 Grand Total < $30,000 Software costs ? - 0
  • 42. NoSQL Data Processing Engine Data Warehouse Data Storage Keep all your raw data Cheaper Hardware NoSQL Low cost per byte $$ High value per byte
  • 43. Exploratory BI / Analysis Data Storage Makes Data exploration practically cheaper and faster Use existing visualization tools (Tableau or other) Check for integration with R
  • 44. Data Architecture • Single Important factor • Don’t miss technology trends But …. It’s more about the battle plan
  • 45. 1Big Data – The Road to Now 2Mainstream Big Data 3Real World Use Cases and Applications 4Practical Adoption : Opportunity Identification 5Big Data 2.0 – What’s on the Horizon ? 6Conclusion
  • 46. What about that RDBMS? Too many new data types Extreme demands for loading & query access Dynamic / just in time schemas SQL is great, but why limit to relational? Still great for transactional workloads
  • 47. What’s Next? Multi-tenant Hadoop SQL on Hadoop Security In-memory Real Time
  • 48. HDFS 2 Storage and Archival MapReduce (BATCH) HBase (online) Hive (interactive) YARN Yet Another Resource Manager In-memory Search Application Container - scale resource management Map Reduce becomes “one type of application workload” Multi-tenant Hadoop
  • 49. SQL on Hadoop Impala Tez Phoenix • Cloudera • MPP Engine • HortonWorks • SQL on Hive • Apache • SQL on HBase
  • 50. In memory and Real Time Spark Storm Apache Drill • 100x faster than M/R • Event processing • Low latency ad hoc queries • Interactive queries at scale
  • 51. Honorable (Proprietary) mentions RDBMS on Hadoop Complete Package MPP, SMP, DataFlow HortonWorks underneath Manage, Analyze machine generated data
  • 52. 1Big Data – The Road to Now 2Mainstream Big Data 3Real World Use Cases and Applications 4Practical Adoption : Opportunity Identification 5Big Data 2.0 – What’s on the Horizon ? 6Conclusion
  • 53. Where can I get Hadoop? Distributors Open Source Apache Project And these guys… Cloud
  • 54. Conclusion The Power & Paradigm of Distributed Computing “Nativity” of Data – Unlearn old notions Identify, understand your data processing pipeline POC with a measurable, specific use case Data Architecture – key to sustainable scalability Stay informed