Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanced Data Analytics at Yahoo - Amr Awadallah, Cloudera

2,229
views

Published on

"Amr Awadallah served as the VP of Engineering of Yahoo's Product …

"Amr Awadallah served as the VP of Engineering of Yahoo's Product
Intelligence Engineering (PIE) team for a number of years. The PIE
team was responsible for business intelligence and advanced data
analytics across a number of Yahoo's key consumer facing properties (search, mail, news, finance, sports, etc). Amr will share the data architecture that PIE had implementted before Hadoop was deployed and the headaches that architecture entailed. Amr will then show how most, if not all of these headaches were eliminated once Hadoop was deployed. Amr will illustrate how Hadoop and Relational Database complement each other within the traditional business intelligence data stack, and how that enables organizations to access all their data under different
operational and economic constraints."

Published in: Technology

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,229
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. November 2011How Apache Hadoop is RevolutionizingBusiness Intelligence and Data AnalyticsDr. Amr Awadallah | Founder, CTO, VP of Engineeringaaa@cloudera.com, twitter: @awadallah
  • 2. Business Intelligence Before Adopting Apache Hadoop BI Reports + Interactive Apps Can’t Explore Original High Fidelity Raw Data RDBMS (processed data) ETL Compute Grid Moving Data To Compute Doesn’t Scale Storage Only Grid (original raw data) Archiving = Mostly Append Premature Collection Data Death Instrumentation2 ©2011 Cloudera, Inc. All Rights Reserved.
  • 3. Business Intelligence After Adopting Apache Hadoop Data Exploration & BI Reports + Interactive Apps Advanced Analytics RDBMS ETL and Aggregations Complex Data Processing Hadoop: Storage + Compute Grid Keep Data Alive For Ever Collection Instrumentation3 ©2011 Cloudera, Inc. All Rights Reserved.
  • 4. So What is Apache Hadoop ?• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license).• Core Hadoop has two main components: – Hadoop Distributed File System: self-healing high-bandwidth clustered storage. – MapReduce: fault-tolerant distributed processing.• Key business values: – Flexible – Store any data, Run any analysis (Mine First, Govern Later). – Scalable – Start at 1TB/3-nodes then grow to petabytes/1000s of nodes. – Affordable – Cost per TB at a fraction of traditional options. – Open Source – No Lock-In, Rich Ecosystem, Large developer community. – Broadly adopted – A large and active ecosystem, Proven to run at scale. 4 ©2011 Cloudera, Inc. All Rights Reserved.
  • 5. The Main Benefit: Agility/FlexibilitySchema-on-Write (RDBMS): Schema-on-Read (Hadoop):• Schema must be created before • Data is simply copied to the file data is loaded store, no transformation is needed• Explicit load operation has to • A SerDe (Serializer/Deserlizer) is take place which transforms applied during read time to extract data to DB internal structure the required columns• New columns must be added • New data can start flowing anytime explicitly before data for such and will appear retroactively once columns can be loaded into the the SerDe is updated to parse it database• Read is Fast • Load is Fast Benefits• Standards/Governance • Flexibility/Agility 5 ©2011 Cloudera, Inc. All Rights Reserved.
  • 6. What is Complex Data Processing?1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the “assembly language” of Hadoop).2. Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce.3. Crunch: A library for multi-stage MapReduce pipelines in Java.4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads.5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes.6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.6 ©2011 Cloudera, Inc. All Rights Reserved.
  • 7. What This Means For You: AgilityUp Front Design Just in Time7 ©2011 Cloudera, Inc. All Rights Reserved.
  • 8. What This Means For You: Innovation Data Committee Data Scientist8 ©2011 Cloudera, Inc. All Rights Reserved.
  • 9. What This Means For You: Consolidation Silos Sharing 9 ©2011 Cloudera, Inc. All Rights Reserved.
  • 10. What This Means For You: Extract Value from Latent Data Archive to Tape Keep Data Alive10 ©2011 Cloudera, Inc. All Rights Reserved.
  • 11. What This Means For You: Ability to Grow Fluidly11 ©2011 Cloudera, Inc. All Rights Reserved.
  • 12. What This Means For You: Data Beats Algorithm Smarter Algos More Data12 ©2011 Cloudera, Inc. All Rights Reserved.
  • 13. Where Does Hadoop Fit in the Enterprise Data Stack? Data Scientists Analysts Business Users Enterprise IDEs BI, Analytics Reporting Development Tools Business Intelligence Tools System Operators Cloudera Mgmt Suite Enterprise ETL Tools Data Warehouse DataArchitects Customers Low-Latency Web Serving Application Relational Systems Logs Files Web Data Databases 13 ©2011 Cloudera, Inc. All Rights Reserved.
  • 14. Use The Right Tool For The Right Job Relational Databases: Hadoop:Use when: Use when:• Interactive OLAP Analytics (<1sec) • Structured or Not (Flexibility)• Multistep ACID Transactions • Scalability of Storage/Compute• 100% SQL Compliance • Complex Data Processing14 ©2011 Cloudera, Inc. All Rights Reserved.
  • 15. Two Core Use Cases Common Across Many IndustriesUse Case Application Industry Application Use Case Web ADVANCED ANALYTICS Social Network Analysis Clickstream Sessionization DATA PROCESSING Content Optimization Media Clickstream Sessionization Network Analytics Telco Mediation Loyalty & Promotions Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping Product Quality Manufacturing Mfg Process Tracking 15 ©2011 Cloudera, Inc. All Rights Reserved.
  • 16. CDH: Cloudera’s Distribution Including Apache HadoopThe #1 commercial and non-commercial Apache Hadoop distribution. File System Mount UI Framework/SDK Data Mining FUSE-DFS HUE APACHE MAHOUT Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Languages / Compilers APACHE PIG, APACHE HIVE Fast Read/Write Data Integration Access APACHE FLUME, APACHE HBASE APACHE SQOOP Coordination APACHE ZOOKEEPER• Open Source – 100% Apache licensed, 100% Open Source, 100% Free, No Forks.• Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA.• Proven at Scale – Deployed at hundreds of enterprises across many industries.• Integrated – All required component versions & dependencies are managed for you.• Industry Standard – Existing RDBMS, ETL and BI systems work best with it.• Many Form Factors – Public Cloud, Private Cloud, RHEL, Ubuntu, 32/64bit, etc. 16 ©2011 Cloudera, Inc. All Rights Reserved.
  • 17. CDH Integrates with Existing IT Infrastructure BI/Analytics ETL Databases Cloud/OS Hardware Cloudera’s Distribution including Apache Hadoop 17 ©2011 Cloudera, Inc. All Rights Reserved.
  • 18. What is Cloudera Enterprise?Cloudera Enterprise makes open CLOUDERA ENTERPRISE COMPONENTSsource Apache Hadoop enterprise-easy Simplify and Accelerate Hadoop Deployment Cloudera Production- Management Level Support Reduce Adoption Costs and Risks Suite Lower the Cost of Administration Comprehensive Our Team of Experts Increase the Transparency & Control of Hadoop On-Call to Help You Toolset for Hadoop Leverage the Experience of Our Experts Administration Meet Your SLAs 3 of the top 5 telecommunications, mobile services, defense &intelligence, banking, media and retail organizations depend on Cloudera EFFECTIVENESS EFFICIENCY Ensuring Repeatable Value from Enabling Apache Hadoop to be Apache Hadoop Deployments Affordably Run in Production18 ©2011 Cloudera, Inc. All Rights Reserved.
  • 19. SCM Express: Simplifies Installation and Configuration Service & Configuration Manager (SCM) Express takes the complexity out of deploying and configuring CDH.  Provision a complete Hadoop stack in minutes  Centrally manage system services through a user- friendly interface  Manages services for up to 50 nodes  FREE to downloadKEY FEATURES Automated, wizard- Central, real-time Ability to configure the Incorporates Automates thebased installation of dashboard for cluster while it’s comprehensive expansion of servicesthe complete Hadoop configuration running validation and error to new nodes when they stack management checking come online 1 2 3 4 5 19 ©2011 Cloudera, Inc. All Rights Reserved.
  • 20. What I Would Like You To Remember:• The Key Benefits of the Apache Hadoop Data Platform: – Agility/Flexibility (Enables Exploration/Innovation). – Complex Data Processing (Any Language, Any Problem). – Scalability of Storage/Compute (Freedom to Grow). – Economical Active Archive (Keep All Your Data Alive).• Cloudera Enterprise enables: – Lower the Cost of Management and Administration. – Simplify and Accelerate Hadoop Deployment. – Increase the Transparency & Control of Hadoop. – Firm SLAs on Issue Resolution.20 ©2011 Cloudera, Inc. All Rights Reserved.