Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks

2,143 views

Published on

Arun Murthy will be discussing the future of Hadoop and the next steps in what the big data world would start to look like in the future. With the advent of tools like Spark and Flink and containerization of apps using Docker, there is a lot of momentum currently in this space. Arun will share his thoughts and ideas on what the future holds for us.

Bio:-
Arun C. Murthy
Arun is a Apache Hadoop PMC member and has been a full time contributor to the project since the inception in 2006. He is also the lead of the MapReduce project and has focused on building NextGen MapReduce (YARN). Prior to co-founding Hortonworks, Arun was responsible for all MapReduce code and configuration deployed across the 42,000+ servers at Yahoo!. In essence, he was responsible for running Apache Hadoop’s MapReduce as a service for Yahoo!. Also, he jointly holds the current world sorting record using Apache Hadoop. Follow Arun on Twitter: @acmurthy.

Published in: Technology

The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks

  1. 1. Hadoop – Looking to the Future Arun C. Murthy Hortonworks Co-Founder @acmurthy
  2. 2. Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ACTIONABLE INTELLIGENCE INTERNET OF ANYTHING Hortonworks Connected Data Platforms that power modern data apps with actionable intelligence from ALL data. PERISHABLE INSIGHTS HISTORICAL INSIGHTS HDP for DATA AT REST HDF for DATA IN MOTION MODERN DATA APPS
  3. 3. Looking Back…
  4. 4. 1   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   MapReduce   Largely  Batch  Processing   2006 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop  w/  MapReduce Traditional Hadoop allowed early adopters to deal with data at scale however… •  Single purpose clusters, specific data sets •  Primarily a batch system using MapReduce •  Difficult to natively integrate existing applications •  Limited enterprise capabilities: Operations, Security & Governance In the beginning…
  5. 5. 20092006 1   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   MapReduce   Largely  Batch  Processing   Hadoop  w/  MapReduce MAPREDUCE-­‐279 Common data,
 multiple applications •  Support multi-tenant cluster •  Batch, interactive & real-time 
 use cases can leverage the 
 most appropriate engine Architectural Center •  Consistent security, 
 governance & operations •  Ecosystem applications 
 run natively in Hadoop Apache Hadoop 2.0 & YARN October 23, 2013 YARN : Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS 
 (Hadoop Distributed File System) Batch Interactive Real-Time
  6. 6. YARN : Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS 
 (Hadoop Distributed File System) Legacy MapReduce Interactive SQL Apache Tez Other Engines & Workloads Apache Hive SQL Business Analytics Custom Apps Apache Hive and the Power of YARN Stinger Initiative
 Next generation SQL based 
 interactive query in Hadoop Speed Performance increased 100x for interactive & batch use cases Scale Queries from GBs, to TBs to PBs SQL Broadest range of SQL semantics Apache Hive Community 1,672 Jira Tickets Closed 145 Developers 44 Companies ~390,000 Lines Of Code Added… (2x) 13 Months Hive 13 Hive 12 Hive 10 Dramatically 
 faster queries 
 speeds time 
 to insight seconds thousands 
 of seconds
  7. 7. YARN : Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS 
 (Hadoop Distributed File System) Legacy MapReduce Interactive SQL Apache Tez Other Engines & Workloads Apache Hive SQL Business Analytics Custom Apps Apache Hive – Interactive SQL in Hadoop Stinger
 Next generation SQL based 
 interactive query in Hadoop ORC IO Improvements Efficient processing via complex pushdown Tez Powerful primitives for the SQL Planner VQP Efficient CPU utilization in Inner Loop
  8. 8. Sub-Second SQL with Hive LLAP Stinger.Next
 Sub-second SQL in Hadoop via Hive/ LLAP CBO The “right” plan executed violently… LLAP Metastore Extensive stats & scalability YARN : Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS 
 (Hadoop Distributed File System) LLAP Apache Tez Other Engines & Workloads Apache Hive Sub-second SQL Business Analytics Custom Apps Long-lived daemon for low- latency startup, caching & CPU efficiency via JIT
  9. 9. © Hortonworks Inc. 2015. All Rights Reserved Apache Atlas - Data Governance Initiative Requirements 1.  Hadoop must snap in to the existing frameworks and openly exchange metadata 2.  Hadoop must address governance within its own stack of technologies Engineers from a group of companies dedicated to meeting these requirements in the open New Apache project proposal Knowledge Store Audit Store (Ranger) ModelsType-System Policy RulesTaxonomies Tag Based Policies Data Lifecycle Management (Falcon) Real-time Tag-based Access Control (Ranger) REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Energy PPDM Retail PCI PII Other CWM
  10. 10. © Hortonworks Inc. 2015. All Rights Reserved HDFS – Tiered Storage
  11. 11. © Hortonworks Inc. 2015. All Rights Reserved Looking Ahead…
  12. 12. © Hortonworks Inc. 2015. All Rights Reserved Data Trends Internet of Anything Aggregate any and all IoAT data from sensors, machines, geolocation, clicks, files, social Mediate secure point-to- point and bi-directional data flows
  13. 13. © Hortonworks Inc. 2015. All Rights Reserved 14       Data Trends It is cheap to create, collect and curate all data
  14. 14. © Hortonworks Inc. 2015. All Rights Reserved     15   Data Trends ContainerizaCon       Virtualization
  15. 15. © Hortonworks Inc. 2015. All Rights Reserved 16       Data Trends Modern Applications Are Data Applications Easy to Consume & Operate Secure Repeatable
  16. 16. © Hortonworks Inc. 2015. All Rights Reserved     Assemble  Modern  Data  ApplicaHons YARN.NEXT Assemble Select Engines & Services Wire Secure & Operate Service Container Container Container Data Service Data Service Service Engine Service Data Service Service Container Engine Data Security Admin Container
  17. 17. © Hortonworks Inc. 2015. All Rights Reserved Real-Time Cyber Security It’s not just how quickly or how much data you ingest – it’s about ingesting and enriching data in real-time in order to provide actionable intelligence that stop cyber threats…
  18. 18. © Hortonworks Inc. 2015. All Rights Reserved Ex. Real-Time Cyber Security with Hortonworks Raw Network Stream Network Metadata Stream Data Stores Syslog Raw Application Logs Other Streaming Telemetry ParseandFormat Enrich Persist Applications and Analyst Tools (ex. Zeppelin on Spark) Log Mining and Analysis Network Packet Mining and PCAP Reconstruction Big Data Exploration, Predictive Modeling SOLR HBase Hive Threat Intelligence Feeds Enrichment Data Real-Time Index Raw Packet Store Long-Term Store Deliver actionable insights from real-time and historical network threat alerts Key components: NiFi -> Kafka -> Storm
  19. 19. Thank You @acmurthy

×