Your SlideShare is downloading. ×
  • Like
Hadoop for shanghai dev meetup
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hadoop for shanghai dev meetup

  • 1,995 views
Published

 

Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
1,995
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
172
Comments
1
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop (Shanghai Developer Meetup – Sept 15, 2011) 余家昌 (Andrew Yu) EMC Greenplum© Copyright 2011 EMC Corporation. All rights reserved. 1
  • 2. The Elephant Chase© Copyright 2011 EMC Corporation. All rights reserved. 2
  • 3. © Copyright 2011 EMC Corporation. All rights reserved. 3
  • 4. Yahoo! Hadoop use cases• Personalized Yahoo! Homepage• Yahoo! Mail anti-spam• Search and Ad pipelines• Ad inventory prediction• Data analytics• etc© Copyright 2011 EMC Corporation. All rights reserved. 4
  • 5. Enterprise Use Case: “Big ETL”Challenge: Transform Massive Data Solution: Hadoop/MapReduce as ETLFlows Containing Data Needed for fabric to load to Analytic DatabaseComplex Analysis• Examples: • Components: – Web Traffic Reduction – Hadoop: Massively-parallel ingest, storage and – Network Traffic & Performance Analysis analysis – Location Analytics for People and Goods – MapReduce: Runs multiple cascaded custom analysis / extraction on capture data – Smart Electric Power Grid – Connectors move structured data to Analytics – Genome Analysis DB – Clinical Outcome Research & Analysis • Hadoop’s Roles:• Data Sources: – Capture TBs/day of machine-generated data – Web server & app server logs – Quality: Run data quality tasks in MapReduce – CDR / xDRs – Execute MapReduce flows – Router & Switching Subsystem Logs – Extract/Combine data/metadata – Sensor networks – Move processed data to analytic DB• Limitations & Cautions: – Software development, More parts (Cascading/Flow), Maintainability© Copyright 2011 EMC Corporation. All rights reserved. 5
  • 6. Enterprise Use Case: Fraud DetectionChallenge: Identify & alert fraudulent Solution: Hadoop/MapReduce to filteractivity patterns & correlate communications• Examples: • Components: – ESP’s - Email Fraud – Hadoop: Massively-parallel ingest, – Finance/Banking - Bank Fraud storage and analysis – Advertising - Click Fraud – Mahout: Machine learning tool for building – Telecom – Network fraud fraud algorithms – MapReduce: Rapid analysis & algorithm• Data Sources: deployment – Web & app server logs • Hadoop’s Role(s): – IP/Call Records – Massive ingest of historical/real-time data – Email Traffic – Build/Validate model for fraud detection – Customer Transaction Data manually or using Mahout – Banking/Credit Data – Parallel MapReduce jobs for near real- time fraud detection• Limitations & Cautions: – Software development, Partial Solution (not Real-time, not Interactive) –© Copyright 2011 EMC Corporation. All rights reserved. 6
  • 7. Enterprise Use Case: Cluster Analysis Challenge: Grouping a collection of Solution: Process and Refine in data according to common similarities Hadoop and load into Analytical DB• Examples: • Components: – Customer segmentation – Hadoop: Flexible data storage as volume – Financial cost/risk analysis increases and structures vary – Patient-centric healthcare – MapReduce: Cascading allows data – Financial stock classification processing with minimal adjustments – Social network analysis – Optional: Connectors to move results to Analytic DB• Data Sources: • Hadoop’s Role(s): – Health records – Flexible: Allow agile implementation of – Sales data and unit testing of algorithms – Human genome sequences – Large scale analysis in Hadoop creates – Financial trading data more accurate groupings – Facebook/Twitter/LinkedIn – Rapid, parallel processing in MapReduce• Limitations & Cautions: – Software development, Complex Integration with Sources© Copyright 2011 EMC Corporation. All rights reserved. 7
  • 8. Greenplum HD: Community Edition Stack 100% APACHE Hive Pig HBase Zookeeper MapReduce Framework (MapRed) Hadoop Distributed File System (HDFS)Currently supported Future releases may include support for Oozie and Mahout © Copyright 2011 EMC Corporation. All rights reserved. 9
  • 9. Greenplum HD: Enterprise Edition Stack 100% APACHE Enhanced Monitoring INTERFACE Hive Pig HBase Zookeeper MapReduce Framework (MapRed) Hadoop Distributed File System (HDFS)Currently supported Future releases may include support for Oozie and Mahout © Copyright 2011 EMC Corporation. All rights reserved. 10
  • 10. Greenplum HD: Enterprise EditionEnterprise-Ready Hadoop Platform for Unstructured Data • 2 – 5x Faster than Apache Faster Hadoop • High Availability Reliable • Mirroring Easier to • NFS mountable Use • System Management© Copyright 2011 EMC Corporation. All rights reserved. 11
  • 11. Greenplum Enterprise HD is Faster thanOther Distributions DFSIO Terasort (higher is better) (lower is better) 1000 250 Elapsed time in minutes 900 800 200 700 MB/sec 600 150 500 400 100 300 200 50 100 0 0 Read Write 3.5 TB 10 node cluster, 2x Quad-Core, 24G DRAM, 12 x 1TB SATA Drives @ 7200 rpm, Quad NICs© Copyright 2011 EMC Corporation. All rights reserved. 12
  • 12. Greenplum Enterprise HDDistributed Name Node• Fully distributed Hadoop Hadoop Node Node service running on NN NN all Hadoop nodes Hadoop Hadoop Node NN Node NN• Automatic and Hadoop Hadoop transparent failover Node NN Node NN• Persistent metadata Hadoop Node Hadoop Node NN NN• Highly scalable in Hadoop Hadoop Node Node number of files NN NN© Copyright 2011 EMC Corporation. All rights reserved. 13
  • 13. Greenplum Enterprise HDJob Tracker High Availability• Assures business continuity• Designed for mission Greenplum Enterprise HD Distribution for Apache Hadoop critical use – Automatic stateful restart – Task Tracker reconnects Enterprise HD MapReduce without task loss Distributed – Persistent completed task Job Tracker HA Name Node state Enterprise HD Lockless Storage Services© Copyright 2011 EMC Corporation. All rights reserved. 14
  • 14. Greenplum Enterprise HDSnapshots• Intelligent Snapshots – Automatic data deduplication Hadoop / HBASE NFS APPLICATIONS APPLICATIONS – Block sharing for space READ / WRITE savings Enterprise HD Lockless Storage• Fast and flexible Services – Zero performance loss when REDIRECT ON WRITE FOR SNAPSHOT writing to the original A B C C’ D• Easy to manage – Scheduled or on-demand – Drag and drop recovery Snapshot Snapshot Snapshot 1 2 3© Copyright 2011 EMC Corporation. All rights reserved. 15
  • 15. Greenplum Enterprise HDMirroring • Business Continuity Production Research – Efficient design – Differential deltas are updated – Data is compressed and Datacenter 1 WAN Datacenter 2 check-summed • Easy to manage – Scheduled or on-demand – Consistent point-in-time Production WAN Cloud© Copyright 2011 EMC Corporation. All rights reserved. 16
  • 16. Greenplum Enterprise HD Direct Access Using NFS• Simple application integration Greenplum Enterprise HD Distribution for Apache Hadoop – Leverage NFS for random read/write Enterprise HD MapReduce access• Direct access for Job Tracker HA Distributed Name Node standard Hadoop tools – Command line utilities Enterprise HD Lockless Storage Services – File browsers – Desktop utilities© Copyright 2011 EMC Corporation. All rights reserved. 17
  • 17. Greenplum Enterprise HD Simple Management• Intuitive• Insightful• Complete• One node or thousands © Copyright 2011 EMC Corporation. All rights reserved. 18
  • 18. Greenplum HD: Software DistributionsFeatures Community Edition Enterprise EditionApache Compatibility 100% Apache Open Source 100% API CompatibleName Node High Availability Reference Implementation Distributed and High AvaiabilityJob Tracker HA Reference Implementation HT High AvailabilityName Node Scalability NN Metadata in Memory Distributed Name NodePremium Support Yes YesPerformance 2 - 5x than Community EditionSnapshots No YesMirrors No YesNFS Mounts No YesSystem Management No YesAvailable for Ordering May 9th 2011 Q3Pricing Per Node Pricing Per Node Pricing © Copyright 2011 EMC Corporation. All rights reserved. 19
  • 19. Greenplum HD onData Computing Appliance• Introducing the world’s first: – High-performance – Purpose-built – Data co-processing Hadoop appliance• Combining Greenplum Database and Greenplum Hadoop in one appliance© Copyright 2011 EMC Corporation. All rights reserved. 20
  • 20. GPDB  GPHD Interoperability GPHD data in/out GPHD in GPDB Query File on HD GPDB External Tables© Copyright 2011 EMC Corporation. All rights reserved. 21
  • 21. Greenplum DatabaseExternal Tables for Hadoop• Bring GPDB relational expressive Example: power to HDFS – HDFS data presented as external tables Select count(*) from – HDFS data supporting full SQL syntax HDFS_data h, GPDB_data g• Have ALL, PART or NONE of your where h.key = g.key; data in HDFS Insert into• Leverage full parallelism of both HDFS_data select * Hadoop and GPDB from GPDB_data; – GPDB can read from/write to HDFS,© Copyright 2011 EMC Corporation. All rights reserved. 22
  • 22. Greenplum Enterprise HDHDFS Integration – Parallelized Flow• Reading: – Each GPDB segment reads a portion of the file • Segment i of n reads the i/n-th portion – Access offset from HDFS namenode – Read data directly from HDFS datanode• Writing: – Each GPDB segment writes a file – HDFS balancing distributes the load evenly across datanodes© Copyright 2011 EMC Corporation. All rights reserved. 23
  • 23. Big Data Analytics “Stack” Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Database Greenplum HD World’s Most Scalable MPP Database Platform Enterprise Analytics Platform for Unstructured Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics© Copyright 2011 EMC Corporation. All rights reserved. 24
  • 24. THANK YOU© Copyright 2011 EMC Corporation. All rights reserved. 25