Your SlideShare is downloading. ×
0
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
OOP 2014
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

OOP 2014

184

Published on

Data-Lake talk at OOP 2014 …

Data-Lake talk at OOP 2014

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
184
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Hello Today I’m going to talk to you about HW and how we deliver an Enterprise Ready Hadoop to enable your modern data architecture.
  • Founded just 2.5 years ago from the original hadoop team members a yahoo.Hortonworks emerged as the leader in open source Hadoop.We are commited to ensure H is an enterprise viable data platform ready for your modern data architectureOur team is probably the largest assembled team of Hadoop experts and active leaders in the communityWe not only make sure Hadoop meets all your enterprise requirements likeOperations, reliablity & SecurityIt also needs to bePackaged & Tested and we do this.It has to work with what you have Make Hadoop an enterprise data platform. Make the market function.Innovate core platform, data, & operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as “the standard”Promote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners
  • On left hands side you see the traditional sources of data you have. Data base or Web applications may be growing by 8% year over year. And while there is also a lot of innovation happening in this space Over the last couple of years we see more and more pressure on these traditional systems as new datasources bringing a massive growth of new data:Smartphones, Sensors, Internet of things, logs. No efficient way to store and analyze this data.
  • Then Hadoop entered the scene. What we’re seeing in most organizations is that they bring Hadoop in to the datacenter not to replace the existing systems but to augment or support them. They use the right tool at the right place for the right type of data.Hadoop really is the Landing spot for these new data sources we discussed before. It provides a way to store and process these types of new data in a very cost effective manor. While it’s very cost effective it also scales horizontally and linear which was a key requirement when it was invented at Yahoo: When you need to index the web you better know how to scale and you better can handle the distributed nature of a cluster.
  • Net New Analytic applications.How to extract value from the new sources60-70% of Hadoop installations are of this type.
  • Make Hadoop an enterprise data platformInnovate core platform, data, & operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as “the standard”Promote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners
  • Transcript

    • 1. Hortonworks: We Do Hadoop. Our mission is to enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop Emil A. Siemes esiemes@hortonworks.com Solution Engineer January 2014
    • 2. Enable your Modern Data Architecture by Our Mission: Delivering Enterprise Apache Hadoop Our Commitment Headquarters: Palo Alto, CA Employees: 300+ and growing Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process Trusted Partners Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills Page 2
    • 3. APPLICATIONS A Traditional Approach Under Pressure Custom Applications Business Analytics Packaged Applications DATA SYSTEM 2.8 ZB in 2012 85% from New Data Types RDBMS EDW MPP REPOSITORIES 15x Machine Data by 2020 40 ZB by 2020 SOURCES Source: IDC Existing Sources Emerging Sources (CRM, ERP, Clickstream, Logs) (Sensor, Sentiment, Geo, Unstructured) Page 3
    • 4. APPLICATIONS Emerging Modern Data Architecture Custom Applications Business Analytics Packaged Applications DEV & DATA TOOLS SOURCES DATA SYSTEM BUILD & TEST OPERATIONAL TOOLS RDBMS EDW MANAGE & MONITOR MPP REPOSITORIES Existing Sources Emerging Sources (CRM, ERP, Clickstream, Logs) (Sensor, Sentiment, Geo, Unstructured) Page 4
    • 5. Drivers of Hadoop Adoption New Business Applications From NEW types of Data (or existing types for longer) Page 5
    • 6. Most Common NEW TYPES OF DATA 1. Sentiment Understand how your customers feel about your brand and products – right now 2. Clickstream Capture and analyze website visitors’ data trails and optimize your website 3. Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines 4. Geographic Value Analyze location-based data to manage operations where they occur 5. Server Logs Research logs to diagnose process failures and prevent security breaches 6. Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents + Keep existing data longer!
    • 7. Drivers of Hadoop Adoption Architectural A Modern Data Architecture New Business Applications Complement your existing data systems: the right workload in the right place Page 7
    • 8. Let’s build a Data Lake… Instructions on: hadoopwrangler.com Page 8
    • 9. HDP Data Lake Solution Architecture Manage Steps 1-4: Data Lifecycle with Falcon FALCON (data pipeline & flow management) Downstream Data Sources Oozie (Batch scheduler) Step 4: Schedule and Orchestrate HIVE SOURCE DATA PIG Step 3: Transform, Aggregate & Materialize EDW ClickStream Data HCATALOG File Sales Transaction/ Data JMS Ingestion Step 1:Extract & Load REST HTTP Social Data Sqoop/Hiv e EDW (Teradata) Step 2: Model/Apply Metadata INTERACTIVE SQOOP compute & storage . Storm Web HDFS Query/ Analytics/Repor ting Tools Hive Server (Tez/Stinger) Tableau/Excel YARN . MR2 . NFS Marketing/I nventory HBase Client OLTP HBase Use Case Type 1: Materialize & Exchange (table & user-defined metadata) FLUME Product Data Mahout (data processing) Exchange . . . Elastic Search . TEZ . . . SAS compute & storage Use Case Type 2: Explore/Visualize Datameer/Platfo ra/SAP Stream Processing, Real-time Search, MPI AMBARI Streaming YARN Apps Data Lake HDP Grid Knox – Perimeter Level Security Opens up Hadoop to many new use cases Page 9
    • 10. Hadoop 2: The Introduction of YARN Store all date in a single place, interact in multiple ways Single Use System Multi Use Data Platform Batch Apps Batch, Interactive, Online, Streaming, … 1st Gen of Hadoop HADOOP 2 Standard Query Processing (cluster resource management & data processing) HDFS (redundant, reliable storage) Real Time Stream Processing Hive, Pig MapReduce Online Data Processing HBase, Accumulo Storm Batch … Interactive MapReduce others Tez Efficient Cluster Resource Management & Shared Services (YARN) Redundant, Reliable Storage (HDFS) Page 10
    • 11. Let’s start simple… • A solution unifying all data sources of a mobile App – Allowing analytics over all data in one place – In real time and long term • Mobile Apps have multiple channels for data: – Data created on the handset (e.g. geo location) – Data created on servers accessed by the mobile app (e.g. app data, logs) – Data from backend services (e.g. RDBMS) – Store data (e.g. iTunes Connect, Google Play) – Social data (Twitter, App Reviews, etc.) Page 11
    • 12. Why Should We Care? • How much revenue did I made? (Not that easy to answer as one could think) • Where are my customers now? • Can you fulfill requirements from the business like: ”Tell me when our customers are in a coffee shop so we can offer them e.g. Wifi” • What are my customers thinking about my app/brand? • Are the ones complaining really using it (correct)? • How can I support marketing activities? • How can I evaluate local marketing activities? • Does positive/negative sentiment effect my downloads? • Will my servers be able to deal with the load in 3 months • … Page 12
    • 13. Design Goals • Use as much as we have in our stack as possible • Minimize dependencies on stacks beyond Hadoop – Still make it useful and complete • Make it fit into a 8GB MacBook/Laptop • Release early & release often Page 13
    • 14. iiCaptain Page 14
    • 15. Types Of Data For iiCaptain • Geo location data • Store Data • iTunes Connect, Google Play, Amazon via AppAnnie • Twitter • RDBMS (Sqoop) • Logs Page 15
    • 16. iiCaptain’s Data Ocean / Data Lake Page 16
    • 17. More Details Page 17
    • 18. Analytics Page 18
    • 19. SQL Interactive Query & Apache Hive Key Services Apache Hive Platform, operational and data services essential for the enterprise • The defacto standard for Hadoop SQL access • Used by your current data center partners • Built for batch AND interactive query Skills SQL Leverage your existing skills: development, analytics, operations Stinger Initiative Integration Interoperable with existing data center investments Broad, community based effort to deliver the next generation of Apache Hive Speed Scale SQL Improve Hive query performance by 100X to allow for interactive query times (seconds) The only SQL interface to Hadoop designed for queries that scale from TB to PB Support broadest range of SQL semantics for analytic applications against Hadoop Page 19
    • 20. Build Process, Shining With Savanna Page 20
    • 21. Roadmap - Servlet Engine in YARN Project Savanna: Continuous Delivery end-2-end Sentiment Analysis with Flume/Hive and App Reviews Knox Falcon Phoenix Page 21
    • 22. HDP 2.0: Enterprise Hadoop Platform OPERATIONAL OPERATIONAL SERVICES SERVICES AMBARI Cluster AMBARI Dataset Mgmnt FALCON FALCON* Mgmnt Schedule OOZIE OOZIE Hortonworks Data Platform (HDP) DATA DATA SERVICES SERVICES FLUME FLUME Data Movement SQOOP SQOOP LOAD & LOAD & EXTRACT EXTRACT NFS NFS CORE CORE SERVICES WebHDFS CORE CORE SERVICES SERVICES KNOX* KNOX* WebHDFS HIVE HBASEData Access HIVE& PIG HCATALOG HBASE MAP Process REDUCE TEZ TEZ ResourceYARN Management Cloud • Integrates full range of enterprise-ready services HDFS Storage HDFS Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS DATA PLATFORM (HDP) OS/VM • The ONLY 100% open source and most current platform • Certified and tested at scale • Engineered for deep ecosystem interoperability Appliance Page 22
    • 23. Hortonworks: The Value of “Open” for You Validate & Try 1. Download the Hortonworks Sandbox 2. Learn Hadoop using the technical tutorials 3. Investigate a business case using the step-bystep business cases scenarios 4. Validate YOUR business case using your data in the sandbox Engage 1. Execute a Business Case Discovery Workshop with our architects 2. Build a business case for Hadoop today Connect With the Hadoop Community We employ a large number of Apache project committers & innovators so that you are represented in the open source community Avoid Vendor Lock-In Hortonworks Data Platform remain as close to the open source trunk as possible and is developed 100% in the open so you are never locked in The Partners you Rely On, Rely On Hortonworks We work with partners to deeply integrate Hadoop with data center technologies so you can leverage existing skills and investments Certified for the Enterprise We engineer, test and certify the Hortonworks Data Platform at scale to ensure reliability and stability you require for enterprise use Support from the Experts We provide the highest quality of support for deploying at scale. You are supported by hundreds of years of Hadoop experience Page 23

    ×