• Save
Hadoop Now, Next and Beyond
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
878
On Slideshare
878
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • 1.0Architected for the Large Web Properties to; Hadoop 2.0 represents the next generation of the foundation of big data. Under development for nearly three years now, It is a more mature version of Hadoop that has been architected for broader use by more generic enterprise. The main focus for this nest generation has been the broader enterprise. They have very explicit requirements that are a little bit different than the typical web properties who first adopted hadoop. Some of the requirements required the community to rethink the approach. Plus, our experience running hadoop at yahoo provided much insight into how we could architect things to make them better.Some of the critical features are listed here. Go through them.Highlight workloads and explain how 2.0 is engineered to meet these exacting demands. There is a graphic to help illustrate. We have moved beyond just batch…
  • Once data is stored into Hadoop to run analytics you have two options – run batch processes ( MR job or Pig) or do interactive querring using Hive..To respond to this Hortonworks have launched the Stinger Initative which aims to make Hive 100x faster through a combination ofMore intelligent query optimization in HiveA modern persistence format called ORCFileAnd finally an execution engine called Tez which enables true interactive data processing on HadoopWhat Causes Latency in Hive?Sub-optimal queries on some join types.Checkpointing to HDFS even when not needed.Stored data not optimized for read.Non-optimized operations for aggregations, projections, etc.High job startup time.Hive 0.11 includes Tez integration
  • Once data is stored into Hadoop to run analytics you have two options – run batch processes ( MR job or Pig) or do interactive querring using Hive..To respond to this Hortonworks have launched the Stinger Initative which aims to make Hive 100x faster through a combination ofMore intelligent query optimization in HiveA modern persistence format called ORCFileAnd finally an execution engine called Tez which enables true interactive data processing on HadoopWhat Causes Latency in Hive?Sub-optimal queries on some join types.Checkpointing to HDFS even when not needed.Stored data not optimized for read.Non-optimized operations for aggregations, projections, etc.High job startup time.Hive 0.11 includes Tez integration
  • Everybody’s adopting Hadoop as a data processing platform because it accepts any kind of data and can process at almost any scale.But, as people adopt Hadoop and throw all this data on they start to find other challenges. For example how do you ensure data is being processed reliably? How do you know I’m not keeping data that is too old? If you process data globally, how do you deal with multi-datacenter replication?The challenge the tools that exist for Hadoop including tools like Oozie, Distcp and others operate at a very low level, so you need expert developers to build and test data processing solutions. This sort of custom development takes a lot of time and money and is error prone since you deal at such a low level.Still everybody does it this way because there aren’t real alternatives. I see a lot of people who use custom scripts to delete files when they get too old. This approach has a lot of drawbacks.Hadoop traditionally doesn’t provide native tools that solve problems like retention, anonymization, reprocessing and other needs.Falcon’s solves this by letting developers work at a much higher level of abstraction.Falcon provides native APIs for data processing, retention, replication and others that abstract away low level tools like scheduling and the mechanical details of replication.With Falcon developers do more, do it easier, and avoid common mistakes.Avoiding common mistakes is probably the most important thing.Data management on Hadoop is not easy, and Falcon was developed by engineers who worked on large scale data management at Yahoo complete with all the battle scars it brings.Falcon has a lot of the practical lessons learned baked into its APIs and ready for developers to simply use.Question: What data lifecycle management needs do you have in your environment?
  • A pretty common scenario we see is that people have a primary and a DR cluster. The DR cluster tends to be smaller than the primary so you don’t want it to do data processing and you need to store less data on it overall.In this case Falcon manages the flow of taking staged data, cleansing, conforming and presenting that data in the primary cluster.For DR purposes you absolutely need the staged data replicated to the backup cluster. However the backup cluster isn’t powerful enough to do data processing in SLA windows and doesn’t have enough storage for all the cleansed and conformed data.So we don’t replicate that, we only replicate the staged data and the presented data.That way if the primary goes down, clients switch to the failover cluster and continue as if nothing happened. The failover cluster has the staged data so it can be re-imported into the primary and re-processed if the primary was lost.All this can be done in one Falcon job. Doing this by hand is extremely error prone.
  • Operators can firewall cluster without end user access to “gateway node”Users see one cluster end-point that aggregates capabilities for data access, metadata and job controlProvide perimeter security to make Hadoop security setup easierEnable integration enterprise and cloud identity management environmentsVerificationVerify identity tokenSAML, propagation of identityAuthenticationEstablish identity at Gateway to Authenticate with LDAP + AD
  • Solid state storage and disk drive evolutionSo far LFF drives seem to be maintaining their economic advantage (4TB drives now & 7TB! next year) SSDs are becoming ubiquitous and will become part of the architectureIn memory databasesBring them on, let’s port them to Yarn!Hadoop complements these technologies, shines w huge dataAtom & ARM processors, GPUs…This is great for Hadoop! But…Vendors are not yet designing the right machines (bandwidth to disk is key bottleneck)Software Defined NetworksMore network functionality for less!
  • So enterprise Hadoop lies at the heart of the next-generation data architecture.Let’s outline what’s required in and around Hadoop in order to make it easy to use and consume by the enterprise.At the center, we start with Apache Hadoop for distributed file storage and processing (a la MapReduce).In order to enable Hadoop within mainstream enterprises, we need to address enterprise concerns such as high availability, disaster recovery, snapshots, security, etc. And on top of this, we need to provide data services that make it easy to move data in and out of the platform, process and transform the data into useful formats, and enable people and other systems to access the data easily.This is where components like Apache Hive, Pig, HBase, HCatalog, and other tools fit.Making it easy for data workers is important, but it’s also important to make the platform easier to operate.Components like Apache Ambari that address provisioning, management and monitoring of the cluster are important here.So all of that: Core and Platform Services, Data Services, and Operational Services all come together into a vision of “enterprise Hadoop”.Ensuring that Enterprise Hadoop Platform can be flexibly deployed across operating systems and virtual environments like Linux, Windows, and Vmware is important.Targeting Cloud environments like Amazon Web Services, Microsoft Azure, Rackspace OpenCloud, and OpenStack is increasingly important.As is the ability to provide enterprise Hadoop pre-configured within a Hardware appliance like Teradata’s Big Analytics Appliance helps pull Hadoo into enterprises as well.

Transcript

  • 1. © Hortonworks Inc. 2013 Hadoop Now, Next, and Beyond Community Driven Enterprise Apache Hadoop Eric Baldeschwieler Hortonworks Co-Founder & CTO @jeric14
  • 2. © Hortonworks Inc. 2013 A Quick History of Apache Hadoop Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/ Yahoo! Commits to Scaling Hadoop For Production Use Research Workloads In Search and Advertising Production (Modeling) with machine Learning & WebMap Revenue Systems with Security Multi- tenancy, and SLAs Increased User-base with partitioned namespaces Hortonworks Spinoff for Enterprise hardening Current Team with Y! focus 400 350 300 250 200 150 100 50 RawHDFSStorage(inPB) NumberofNodes 45,000 40,000 35,000 30,000 25,000 20,000 15,000 10,000 5,000 2006 2007 2008 2009 2010 2011 2012 Year Nodes HDFS Open Sourced with Apache
  • 3. © Hortonworks Inc. 2013 2.0 Architected for the Broad Enterprise Hadoop 2.0 Key Highlights Rolling Upgrades Disaster Recovery Snapshots Full Stack HA Hive on Tez YARN HDP 2.0 Features Single Cluster, Many Workloads BATCH INTERACTIVE ONLINE STREAMING ZERO downtime Multi Data Center Point in time Recovery Reliability Interactive Query Mixed workloads Enterprise Requirements
  • 4. © Hortonworks Inc. 2013 SQL Compliance Highlights Hive: More SQL & 100X Faster Stinger Phase 3 • Vector Query • Buffer Cache • Query Planner Stinger Phase 2 • YARN Resource Mgmnt • Hive on Apache Tez • Query Service Stinger Phase 1 • Base Optimizations • SQL Analytics • ORCFile Format We Are Here Done in Hive 0.11 CHAR VARCHAR DATE DECIMAL Sub-queries for IN/NOT IN, HAVING EXISTS / NOT EXISTS INTERSECT, EXCEPT UNION DISTINCT and UNION outside of subquery ROLLUP and CUBE Windowing functions (OVER, RANK, etc.) Some Work Started
  • 5. © Hortonworks Inc. 2013 Hive’s Performance Trajectory Stinger Phase 3 • Vector Query • Buffer Cache • Query Planner Stinger Phase 2 • YARN Resource Mgmnt • Hive on Apache Tez • Query Service Stinger Phase 1 • Base Optimizations • SQL Analytics • ORCFile Format We Are Here 1 2 21X Faster 0 5 10 15 20 25 Hive 10 Text Hive 10 RC Hive 11 ORC & Tez Star Join Speedup (TPC-DS Query 27) 1 14 78X Faster 0 10 20 30 40 50 60 70 80 90 Hive 10 Text Hive 10 RC Hive 11 ORC & Tez Fact Table Join Speedup (TPC-DS Query 82)
  • 6. Disruptive Forces & Hadoop Impact • Cloud • In Memory Databases • ARM, Atom, GPUs • Solid State Storage
  • 7. © Hortonworks Inc. 2013 Making Hadoop Enterprise Ready OS/VM Cloud Appliance PLATFORM SERVICES HADOOP CORE Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS DATA PLATFORM (HDP) HDFS MAP DATA SERVICES HIVE & HCATALOG PIG HBASE SQOOP FLUME NFS LOAD & EXTRACT WebHDFS KNOX* OPERATIONAL SERVICES OOZIE AMBARI FALCON* YARN TEZ* OTHERREDUCE
  • 8. © Hortonworks Inc. 2013 Enjoy the Summit! http://hortonworks.com/sandbox/ Follow us: @hortonworks
  • 9. Today’s Agenda Wednesday, June 26 8:30 – 10:50 Keynote – Plenary Session 10:50 – 11:20 Break in Community Showcase 11:20 – 12:00 Breakout Sessions 12:50 – 2:05 Lunch 2:05 – 5:35 Breakout Sessions with Breaks 5:35 – 7:00 Exhibitor Reception 7:00 – 10:30 Hadoop Summit Party at The Tech Museum of Innovation