Your SlideShare is downloading. ×
The power of hadoop in cloud computing
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

The power of hadoop in cloud computing

309
views

Published on


0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
309
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
24
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • One of the key benefits of Hadoop is the ability to dump any type of data into Hadoop then the input record readers will abstract it out as if it was structured (i.e. schema on read vs on write) Open Source Software allows for innovation by partners and customers. It also enables third-party inspection of source code which provides assurances on security and product quality. 1 HDD = 75 MB/sec, 1000 HDDs = 75 GB/sec, the “head of fileserver” bottleneck is eliminated. The system is self-healing in the sense that it automatically routes around failure. If a node fails then its workload and data are transparently shifted some where else. The system is intelligent in the sense that the MapReduce scheduler optimizes for the processing to happen on the same node storing the associated data (or co-located on the same leaf Ethernet switch), it also speculatively executes redundant tasks if certain nodes are detected to be slow.
  • Pool commodity servers in a single hierarchical namespace. Designed for large files that are written once and read many times. Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes. Typical Hadoop node is eight cores with 24GB ram and four 1TB SATA disks. Default data block size is 64MB, though most folks now set it to 128MB or even higher
  • Differentiate between MapReduce the platform and MapReduce the programming model. The analogy is similar to the RDBMs which executes the queries, and SQL which is the language for the queries. MapReduce can run on top of HDFS or a selection of other storage systems Intelligent scheduling algorithms for locality, sharing, and resource optimization.
  • Transcript

    • 1. The Power of Hadoop in Cloud ComputingJoey Echeverria, Solutions Architectjoey@cloudera.com, @fwiffo
    • 2. Yahoo! Business Intelligence Before Adopting Hadoop BI Reports + Interactive Apps Couldn’t Explore Original Raw Data RDBMS (200GB/day) ETL Compute Grid Moving Data To Compute Doesn’t Scale Storage Only Grid (20TB/day) Mostly Append Collection Instrumentation Copyright © 2011, Cloudera, Inc. All Rights Reserved. 2
    • 3. BI Problems Before Hadoop• Shrinking ETL Window • 25 hours to process a day’s worth of data• No Scalable ETL Reprocessing To Recover from Data Errors • Active archive• Conformation Loss • A new browser agent• No Queries on Raw Data • New product• No Consolidated Repository • Cross product queries• Only SQL • Photo/Image Transcoding • Satellite Map Processing Copyright © 2011, Cloudera, Inc. All Rights Reserved. 3
    • 4. Yahoo! Business Intelligence After Adopting Hadoop Data Exploration & BI Reports + Interactive Apps Advanced Analytics RDBMS ETL and Aggregations Complex Data Processing Hadoop: Storage + Compute Grid Mostly Append Collection Instrumentation Copyright © 2011, Cloudera, Inc. All Rights Reserved. 4
    • 5. So What is Apache Hadoop?• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)• Core Hadoop has two main components: • Hadoop Distributed File System: self-healing high-bandwidth clustered storage • MapReduce: fault-tolerant distributed processing• Key business values: • Flexible – Store any data, Run any analysis (Mine First, Govern Later) • Affordable – Cost per TB at a fraction of traditional options • Scalable – Start at 1TB/3-nodes then grow to petabytes/thousands of nodes • Open Source – No Lock-In, Rich Ecosystem, Large developer community • Broadly adopted – A large and active ecosystem, Proven to run at scale Copyright © 2011, Cloudera, Inc. All Rights Reserved. 5
    • 6. Hadoop Design Axioms 1. System Shall Manage and Heal Itself 2. Performance Shall Scale Linearly 3. Compute Moves to Data 4. Simple Core, Modular and Extensible Copyright © 2011, Cloudera, Inc. All Rights Reserved. 6
    • 7. HDFS: Hadoop Distributed File System Block Size = 64MBReplication Factor = 3 Infinite Throughput Cost/GB is a few ¢/month vs $/month Copyright © 2011, Cloudera, Inc. All Rights Reserved. 7
    • 8. MapReduce: Distributed Processing Copyright © 2011, Cloudera, Inc. All Rights Reserved. 8
    • 9. AgilitySchema-on-Write (RDBMS): Schema-on-Read (Hadoop):• Schema must be created before • Data is simply copied to the file data is loaded store, no special transformation is needed• Explicit load operation has to take • A Serializer/Deserlizer (SerDe) is place which transforms data to applied during read time to extract database internal structure the required columns• New columns must be added • New data can start flowing anytime explicitly before data for such and will appear retroactively once columns can be loaded into the the SerDe is updated to parse them database Read is Fast Load is Fast Benefits Standards/Governance Evolving Schemas/Agility Copyright © 2011, Cloudera, Inc. All Rights Reserved. 9
    • 10. ScalabilityStart with a few servers and 10s of TBs Grow to 1000s of servers and 10s of PBs AUTO SCALE Copyright © 2011, Cloudera, Inc. All Rights Reserved. 10
    • 11. Active Archive: Keep Data Accessible • Return on Byte (ROB) = value to be extracted from that byte divided by the cost of storing that byte • If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage. High ROB Low ROB Copyright © 2011, Cloudera, Inc. All Rights Reserved. 11
    • 12. Use The Right Tool For The Right Job Relational Databases: Hadoop:Use when: Use when:• Interactive OLAP Analytics (<1sec) • Structured or Not (Agility)• Multistep ACID Transactions • Scalable Storage/Compute• SQL Compliance • Complex Data Processing Copyright © 2011, Cloudera, Inc. All Rights Reserved. 12
    • 13. Cloudera’s Open Source Data Platform (CDH) Cloudera’s Distribution Including Apache Hadoop Hue Hue SDK Oozie Oozie Hive Flume Pig HBase Sqoop Hive Avro ZooKeeper
    • 14. Where Does Hadoop Fit in the Enterprise Data Stack? OPERATORS ENGINEERS ANALYSTS BUSINESS USERS Cloudera EnterpriseManagement § Cloudera Management Suite Enterprise § Cloudera Support IDE’s BI / Analytics Tools Reporting CUSTOMERS Enterprise Data WarehouseCloudera’s Distribution Including Apache Hadoop (CDH) Low-Latency Serving Web Systems Application Relational Logs Files Web Data Databases Copyright © 2011, Cloudera, Inc. All Rights Reserved. 14
    • 15. What Can Hadoop Do For You? Two Core Use Cases Applied Across Verticals1 INDUSTRY TERM VERTICAL INDUSTRY TERM 2 Social Network Analysis Web Clickstream SessionizationADVANCED ANALYTICS Media DATA PROCESSING Content Optimization Engagement Network Analytics Telco Mediation Loyalty & Promotions Analysis Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping Copyright © 2011, Cloudera, Inc. All Rights Reserved. 15
    • 16. Clickstream Sessionization• Question: • Which ads contributed to a sale?• Data: • Click logs (customer, user, time, action)• Scale: • Millions of events per day • Going back up to a year• Timeliness: • Daily reports Copyright © 2011, Cloudera, Inc. All Rights Reserved. 16
    • 17. Clickstream Sessionization (2-hour back window) F G H I 2A B C D E 1 12pm 1pm 2pm 3pm 4pm 5pm 6pm C C D D E E 1 1 12pm 1pm 2pm 3pm 4pm 5pm 6pm Copyright © 2011, Cloudera, Inc. All Rights Reserved. 17
    • 18. Clickstream Sessionization (2-hour back window) F G H I 2A B C D E 1 12pm 1pm 2pm 3pm 4pm 5pm 6pm F G H H I I 2 2 12pm 1pm 2pm 3pm 4pm 5pm 6pm Copyright © 2011, Cloudera, Inc. All Rights Reserved. 18
    • 19. Clickstream Sessionization (2-hour back window) F G H I 2A B C D E 1 12pm 1pm 2pm 3pm 4pm 5pm 6pm C C D D E E 1 1 Copyright © 2011, Cloudera, Inc. All Rights Reserved. 19
    • 20. Clickstream Sessionization (2-hour back window) F G H I 2A B C D E 1 12pm 1pm 2pm 3pm 4pm 5pm 6pm F G H H I I 2 2 Copyright © 2011, Cloudera, Inc. All Rights Reserved. 20
    • 21. Clickstream Sessionization (2-hour back window) F G H I 2A B C D E 1 12pm 1pm 2pm 3pm 4pm 5pm 6pm F G H H I I 2 2 Copyright © 2011, Cloudera, Inc. All Rights Reserved. 21
    • 22. Clickstream Sessionization (MapReduce)def map(ts, event, context): ts1 = truncate(ts) ts2 = ts1 - hours(2) context.write((event.customer, event.user, ts1), event) context.write((event.customer, event.user, ts2), event)def reduce(key, events, context): (customer, user, ts) = key purchases = [] clicks = [] for event in events: if event.isPurchase(): purchases.add(event) else: clicks.add(event) for purchase in purchases: for click in clicks: if purchase.ts - click.ts < 2: context.write(customer, (purchase, click)) Copyright © 2011, Cloudera, Inc. All Rights Reserved. 22
    • 23. Questions? Joey Echeverria joey@cloudera.com @fwiffo Copyright © 2011, Cloudera, Inc. All Rights Reserved. 23