Your SlideShare is downloading. ×
Evolving Hadoop for the Data Society
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Evolving Hadoop for the Data Society


Published on

Why does the world need an Intel Distribution for Apache Hadoop and what's it got to do with OpenStack?

Why does the world need an Intel Distribution for Apache Hadoop and what's it got to do with OpenStack?

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. INTEL CONFIDENTIAL, FOR INTERNAL USE ONLY11Evolving Hadoop for the Data SocietyOpen Platform for Next-Gen Analyticsvin.sharmastrategy & marketingopen source x open data
  • 2. INTEL CONFIDENTIAL2Hope trumps hype
  • 3. INTEL CONFIDENTIAL3Virtuous cycle of data-driven innovationCLOUDRicher data toanalyze2.8 Zettabytes of datagenerated WW in 20121CLIENTSRicheruser experiencesRicher datafrom devicesINTELLIGENT SYSTEMSSources: (1) IDC Digital Universe 2020, (2) IDC40 Zettabytes of data willbe generated WW in 20201
  • 4. INTEL CONFIDENTIAL4Democratize data analysisEnhance scientific understanding, drive innovation,and accelerate medical curesCreate new data-driven business models, reduceresource waste, improve organizational processesIncrease public safety with smart traffic andimprove energy efficiency with smart grids
  • 5. INTEL CONFIDENTIALModels and Cases
  • 6. INTEL CONFIDENTIAL6Data ValueData AnalysisData-Intensive DiscoveryDrugDiscoveryLife SciencesGenomeDataEMRClininicalTrialsSensorDataImagesSimDataPhysical SciencesCensusDataTextA/VSurveysSocial SciencesTreatmentOptimizationHypothesisFormationModeling &PredictionAstronomyParticlePhysicsPublic PolicyTrendAnalysisData Management
  • 7. INTEL CONFIDENTIAL7Value• Enable researchers to discover biomarkers anddrug targets by correlating genomic data sets• 90% gain in throughput; 6X data compressionAnalytics• Provide curated data sets with pre-computedanalysis (classification, correlation, biomarkers)• Provide APIs for applications to combine andanalyze public and private data setsData Management• Use Hive and Hadoop for query and search• Dynamically partition and scale Hbase• 10-node cluster / Intel Xeon E5 processors• 10GbE networkData-Intensive Discovery: GenomicsIntel Distribution
  • 8. INTEL CONFIDENTIAL8Data ValueData AnalysisData-Driven BusinessCustomerServiceTelcoContent CDRIPTraffic ShopProductCustomerBehaviorRetailCustomerBehaviorTransactionsFSINetworkOptimizationProductInnovationMarketInsightBusinessEfficiencyBehaviorModelingFraudAnalyticsClientEngagementData Management
  • 9. INTEL CONFIDENTIAL9Data-Driven Business: Customer ServiceValue• 300 million wireless subscribers• Enable subscriber access to billing data• 30X gain in performance; lower TCOAnalytics• Provides real-time retrieval of 6 months data• Supports new BI with 15 types of queries• Enables targeted ad serving and promotionsData Management• Use Hadoop/HBase for search and analysis• 30 TB/month of billing data• 300K reads/second; 800K inserts/second• 133-node cluster / Intel Xeon E5 processors CDRSubscriber Self Service
  • 10. INTEL CONFIDENTIAL10Data ValueData AnalysisData-Rich CommunitiesCustomerServiceUtilitiesMeterDataInfrastructureDataMonitorDataBehaviorPolice & SecurityIDDemographicsGovernment ServicesNetworkOptimizationSmartGridsSafeStreetsCrimeDetectionCrimePreventionServiceAgilityWaste &Fraud AnalysisData ManagementID Programs
  • 11. INTEL CONFIDENTIAL11Data-Rich Communities: Smart CityValue• Enforce traffic laws and detect license fraud• Monitor and predict traffic patterns• In a city of 31 million peopleAnalytics• Detect traffic law violations automatically• Detect driver license fraud by data mining• Forecast traffic with predictive analyticsData Management• 30,000 cameras• 6Mb/s stream rate per camera• 15 PB of images in active use• 2 billion records in HBaseDetection PreventionRegionalLocal
  • 13. INTEL CONFIDENTIAL1314Si28.085
  • 14. INTEL CONFIDENTIAL14At the intersection of transformative forcesEnabling exascale computingon massive data setsHelping enterprises buildopen interoperable cloudsContributing code andfostering ecosystemHPC Cloud Open Source1018
  • 15. INTEL CONFIDENTIAL15Intel® Distribution for Apache Hadoop* software* Other names and brands may be claimed as the property of others.Hardware-enhanced performance & securityEnables partner innovation in analyticsStrengthens Apache Hadoop* ecosystem
  • 16. INTEL CONFIDENTIAL16Intel® Distribution for Apache Hadoop* softwareversion 3.xAll external names and brands are claimed as the property of others.Intel® Manager for Apache Hadoop softwareDeployment, Configuration, Monitoring, Alerts, and SecurityHDFS 2.0.3Hadoop Distributed File SystemYARN (MRv2)Distributed Processing FrameworkHBase0.96.1ColumnarStoreZookeeper3.4.5CoordinationFlume1.3.0LogCollectorSqoop1.4.1DataExchangePig 0.9.2ScriptingHive 0.10.0SQL QueryOozie 3.3.0WorkflowMahout 0.7Machine LearningHcatalogMetadataConnectorsIngest, Analysis, VisualIntel proprietary Intel enhancements contributed to open source Open source components included without change
  • 17. INTEL CONFIDENTIAL17Intel® Distribution for Apache Hadoop* softwareversion 2.3• File-based encryption in HDFS• Up to 20x faster decryption with AES-NI*• Role-based access control for Hadoop services• Up to 8.5X faster Hive queries using HBase co-processor• Adaptive data replication in HDFS and Hbase• Optimized for SSD with Cache Acceleration Software• Integrated text search with Lucene• Simplified deployment & comprehensive monitoring• Automated configuration with Intel® Active Tuner• Deployment of HBase across mutiple datacenters• Detailed profiling of Hadoop jobs• Simplified design of HBase schemas (+ in 2.4)• REST APIs for deployment and management (+ in 2.4)*Based on internal testingHardware-enhanced SecurityOptimized PerformanceSimplified Management
  • 18. INTEL CONFIDENTIAL18Intel® Distribution for Apache Hadoop* softwareversion 3.0• Cell-level ACLs in HBase• Encryption support in Hive and Pig• Secure inter-node communication with SSL• Compression and CRC with SSE 4.2• Up to 8.5X faster Hive queries using HBase co-processor• Adaptive replication in HDFS and HBase• Snapshot support in Hadoop• SNMP support for monitoring*Based on internal testing• Hadoop 2.0.3 and YARN support• Lustre support• GlusterFS support• Hcatalog support
  • 19. INTEL CONFIDENTIALSecurity & Performance
  • 20. INTEL CONFIDENTIAL20Enterprise data requires defense in depthFirewallGatewayAuthnAuthZEncryptionAudit & AlertsContainment
  • 21. INTEL CONFIDENTIAL21Intel Expressway protects Hadoop APIsAuthnRBACEncryptionContainment• Enforces consistent security policies across all Hadoop services• Serves as a trusted proxy to Hadoop, Hbase, and WebHDFS APIs• Complies with Common Criteria EAL4+, HSM, FIPS 140-2 certifications• Deploys as software, virtual appliance, or hardware applianceHcatalogStargateWebHDFSFirewallREST APIs
  • 22. INTEL CONFIDENTIAL22Kerberos authenticates Hadoop servicesEncryptionContainmentFirewallAPIsAuthenticationKDCrequestticketsend serviceticketrequest servicesend resposevalidateticket41235 IntelManager• Wizard enables setup ofsecure cluster withencrypted key exchange• Manager generates principaland keytab for Hadoopservices• Manager enables batchupload of keytab files
  • 23. INTEL CONFIDENTIAL23Manager simplifies role-based access controlFirewallAuthZ• File, table, and service-level controls• Intel Manager pushes ACLs to each node
  • 24. INTEL CONFIDENTIAL24Intel Distribution provides HDFS encryptionFirewallRBAC• Extends compression codec into crypto codec• Provides an abstract API for general useMapReduceRecordReaderMapCombinerPartitionerLocalMerge & SortReduceRecordWriterHDFSDecryptEncryptDerivativeEncryptDerivativeDecrypt
  • 25. INTEL CONFIDENTIAL25Intel AES-NI accelerates decryption 20x64k 4k 1kAES-NI 460 457 454No AES-NI 87 87 86050100150200250300350400450500Speed(MB/s)AES Encryption64k 4k 1kAES-NI 1266 1259 1253No AES-NI 64 63 630200400600800100012001400Speed(MB/s)AES Decryption20X6XSoftware and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark*and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause theresults to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performanceof that product when combined with other products. For more information go to• OpenSSL 1.0.1c optimized to use Intel AES-NI (7 math functions in processor accelerate AES)• Intel Distribution crypto framework uses OpenSSL 1.0.1c• Patch and design document released to open source (JIRA HADOOP-9331)
  • 26. INTEL CONFIDENTIAL26Learn more about Intel and Hadoop• Unique insights that help you tune,secure, and manage your deploymentin addition to essential understandingof Apache Hadoop• Distilled from years of Intelexperience in deploying andoptimizing Apache Hadoop and HBasefor enterprises• Based on Intel expertise in optimizingthe full Hadoop stack – from Hive onHadoop through Java to Linux on x86hardware Training and Certification Case Studies and Resources
  • 28. INTEL CONFIDENTIAL, FOR INTERNAL USE ONLY2828Savanna: Hadoop on OpenStackIlya EltermanSenior Director Cloud Services
  • 29. • Dev and QA teams - fast clusters provisioning• Data Scientists/Analysts - API to run theanalytic jobs with infrastructure provisioninghappening under the hood• Administrators - centralized clustermanagement and monitoringHadoop on OpenStack Use Cases
  • 30. Goal is to create native OpenStack component toprovision and operate Hadoop clusters on top ofOpenStack. Key characteristics:• Open source• Native for OpenStack• Support for different Hadoop distributions• Makes resources dedicated to IaaS cloudavailable for Hadoop workloadsSavanna Key Principles
  • 31. Savanna Architecture OverviewSavannaPythonClientRESTAPIClusterConfigurationManagerHorizonKeystoneAuthDALNovaGlanceSwiftSavannaPagesHadoopVMProvisioningPluginHadoopVMHadoopVMHadoopVMVMManagerImageRegistry
  • 32. Savanna RoadmapPhase 1 – Completed, April 13thBasic cluster provisioning with “pre-built” imagesPhase 2 – In Progress, July 15thPluggable mechanism of integration with vendor toolingand cluster operations supportPhase 3 – Scoping, 2-3 months"Analytics as a service” - job execution framework, supportdifferent scripting languages
  • 33. Learn more about Savanna• All code and documentation open source• Latest version 0.1.2 from 05/13• Launchpad home page•• Code on stackforgeo Integrated with OpenStack CI/CDo• Active community•
  • 34. INTEL CONFIDENTIALLive DemoSavanna with Intel Distributionat Intel Booth