Your SlideShare is downloading. ×
0
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
20090309berkeley
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

20090309berkeley

1,099

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,099
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
66
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Monday, March 9, 2009
  • 2. Global Information Platforms Evolving the Data Warehouse Jeff Hammerbacher VP of Products, Cloudera March 9, 2009 Monday, March 9, 2009
  • 3. What We’ll Cover Today Oh Crap, He’s Gonna Ramble WARNING: highly speculative talk ahead ▪ What did we build at Facebook, and what did it accomplish? ▪ How will that infrastructure evolve further? ▪ There better be some cloud in there. ▪ In the language of this course, we’ll talk mostly about ▪ infrastructure; some thoughts on services and applications at the end. Please be skeptical and ask questions throughout ▪ Monday, March 9, 2009
  • 4. Facebook: Stage Two You Don’t Even Wanna See Stage One Scribe Tier MySQL Tier Data Collection Server Oracle Database Server Monday, March 9, 2009
  • 5. Facebook: Stage Four Shades of an Information Platform Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Monday, March 9, 2009
  • 6. Data Points: Single Organization 3 data centers: two on west coast, one on east coast ▪ Around 10K web servers, 1.5K Database servers, 0.5K Memcache ▪ Around 0.7K Hadoop nodes, and growing quickly ▪ Relative data volumes ▪ Around 40 TB in Cassandra tier ▪ Around 60 TB in MySQL tier ▪ Around 1 PB in photos tier ▪ Around 2 PB in Hadoop tier ▪ 10 TB per day ingested into Hadoop, 15 TB generated ▪ IMPORTANT: Hadoop tier not retiring data! ▪ Monday, March 9, 2009
  • 7. Data Points: All Organizations 8 million servers shipped per year (IDC) ▪ 20% go to web companies (Rick Rashid) ▪ 33% go to HPC (Andy Bechtolsheim) ▪ 2.5 exabytes of external storage shipped per year (IDC) ▪ Data center costs (James Hamilton) ▪ 45% servers ▪ 25% power and cooling hardware ▪ 15% power draw ▪ 15% network ▪ Jim Gray ▪ “Disks will replace tapes, and disks will have infinite capacity. Period.” ▪ “processors are going to migrate to where the transducers are.” ▪ Monday, March 9, 2009
  • 8. Information Platform Workloads Data collection: event logs, persistent storage, web ▪ Regular processing pipelines of varying granularity ▪ Summaries consumed by external users (e.g. Google analytics) ▪ Summaries for internal reporting ▪ Ad optimization pipeline ▪ Experimentation platform pipeline ▪ Ad hoc analyses ▪ Data transformations and data integrity enforcement ▪ Document indexing ▪ Storage system bulk loading ▪ Model building ▪ Reports ▪ Internal storage system workloads: replication, CRC checks, rebalancing, archiving, short stroking ▪ Monday, March 9, 2009
  • 9. Management Challenges Stuff I Didn’t Want to Worry About Pricing: how much should I be paying for my hardware? ▪ Physical management: rack and stack, disk replacement, etc. ▪ Backup/restore, archive, capacity planning ▪ Optimal node configuration: CPU, memory, disk, network ▪ Optimal software configuration ▪ Geographic diversity for data availability and low latency ▪ Access control, encryption, and other security measures ▪ Tiered storage: separation of “hot” and “cold” data ▪ Monday, March 9, 2009
  • 10. Okay, Let’s Get to the Cloud Stuff The cloud can help in removing management challenges ▪ Replicate highly valuable data into the cloud ▪ Archive cold data into the cloud ▪ Knit global data centers together with the cloud ▪ See “Watch for Goats in the Cloud” from David Slik of Bycast ▪ http://tr.im/h9LK ▪ Monday, March 9, 2009
  • 11. Cloud Challenges Current clouds are not optimized for data intensive workloads ▪ Organizations own significant hardware assets ▪ Identity management ▪ Privacy and security ▪ Cloud seeding ▪ Moving data from the customer’s data center to the cloud ▪ Moving data within a mega-datacenter ▪ Moving data between clouds ▪ Monday, March 9, 2009
  • 12. Bare Metal Cloud (Hosting?) Providers OpSource: integrated billing ▪ SoftLayer: data center API ▪ 3tera: “virtual private data center” ▪ GoGrid Cloud Connect ▪ Rackspace Platform Hosting ▪ The Planet ▪ Liquid Web ▪ Layered Tech ▪ Internap ▪ Terremark Enterprise Cloud ▪ Monday, March 9, 2009
  • 13. Optimizing Hardware for DISC We Need Less Power, Captain? “FAWN: A Fast Array of Wimpy Nodes” ▪ DHT built at CMU with XScale chips and flash storage ▪ “Low Power Amdahl Blades for Data Intensive Computing” ▪ Couple low-power CPUs to flash SSDs for DISC workloads ▪ “Seeding the Clouds”, Dan Reed ▪ Also “Microsoft Builds Atomic Cloud” ▪ Microsoft’s Cloud Computing Futures (CCF) team exploring ▪ clusters built from nodes using low-power Atom chips Monday, March 9, 2009
  • 14. Cloud Residue What Happens to Existing Hardware? Cloud pricing is not competitive when a company already owns ▪ excess server capacity and employs a significant operations team How can we speed the transition to the cloud? ▪ Consolidate existing secondary market for hardware ▪ purchase from companies with declining pageviews, e.g. MySpace ▪ Two birds, one stone: ship existing servers with initial data load ▪ to cloud provider (“cloud seeding”, see later slide) Wait it out: servers generally considered to have a three year ▪ lifespan Where do servers go when they die? ▪ Monday, March 9, 2009
  • 15. Identity Across Clouds Configuring your LDAP server to speak to each new cloud utility ▪ is a pain Authentication and authorization systems being built by every ▪ new cloud provider Every organization imposes dierent standards on cloud ▪ providers Consumer identity platforms ▪ Facebook Connect ▪ OpenID + OAuth ▪ I don’t have a good answer here--any thoughts appreciated! ▪ Monday, March 9, 2009
  • 16. Privacy and Security Every organization must reinvent and build expertise in these mechanisms ▪ Components ▪ Physical security ▪ Cloud connection: authentication, authorization, encryption ▪ Audit logging ▪ Data obfuscation ▪ Separation from other customers in multi-tenant environment ▪ Segregation of individual users within a customer’s cloud ▪ Storage retirement (disk shredding!) ▪ Controlling access of cloud provider employees ▪ Compliance, certification, and legislation ▪ Ramifications of security breach ▪ Monday, March 9, 2009
  • 17. Cloud Seeding Let’s Get This Party Started Freedom OSS oers AWS-certified “Cloud Data Transfer Service” ▪ See http://www.freedomoss.com/clouddataingestion ▪ Bycast puts two or more “edge servers” on premise to perform ▪ initial data ingestion, then ships those servers to their cloud data center See http://tr.im/h9PH ▪ If you can’t physically ship the disks, leverage Metro Ethernet or ▪ a dedicated link Investigate modified network stacks (see following slide) ▪ Monday, March 9, 2009
  • 18. Bulk Data Transfer Between Data Centers Companies ▪ WAM!NET (bought by SAVVIS) ▪ Aspera Software ▪ Protocols ▪ GridFTP ▪ UDT ▪ Unix utility: bbcp ▪ Modify congestion control ▪ WAN optimization tricks: compress, transfer deltas, cache, etc. ▪ Peering, transit, OC levels, all that good stu ▪ Monday, March 9, 2009
  • 19. Data Transfer Within a Data Center Hierarchical topology ▪ Border routers, core switches, and top of rack switches ▪ Top of rack switches usually oversubscribed ▪ Diversity of protocols ▪ Ethernet, Infiniband, Fibre Channel, PCI Express, etc. ▪ Networking companies working to flatten topology and unify ▪ protocols Cisco: Data Center Ethernet (DCE) ▪ Juniper: Stratus, a.k.a. Data Center Fabric (DCF) ▪ MapReduce architected to push computation to the data; will ▪ such logic be necessary in the near future? Monday, March 9, 2009
  • 20. Data Transfer Between Clouds Most cloud providers present novel APIs for data retrieval ▪ e.g. S3, SimpleDB, App Engine data store, etc. ▪ It’s usually cheaper (or free) to transfer data within a cloud ▪ Standards and organizations are emerging ▪ Open Virtualization Format (OVF) ▪ Open Cloud Consortium (OCC) ▪ Cloud Computing Interoperability Forum (CCIF) ▪ “Unified Cloud Interface” (UCI) ▪ Their diagrams scare me, a little ▪ Monday, March 9, 2009
  • 21. Service and Application Changes “Pay as you go” is shared motto of Dataspaces and the Cloud ▪ Not a coincidence ▪ Persisting data into information platform should be trivial ▪ Layer storage and processing capabilities onto platform ▪ Catalog ▪ Search ▪ Query ▪ Statistics and Machine Learning ▪ Materialize data into storage system best suited to workload ▪ Leverage workload metadata to get better over time ▪ Monday, March 9, 2009
  • 22. Future Stages Potential Evolutions, pt. 1 Global snapshots of the distributed file system ▪ Tiered storage to accommodate “cold” data ▪ Streaming computations over live data ▪ Higher-level libraries for text mining, linear algebra, etc. ▪ Tighter coupling between data collection, job scheduling, and ▪ reporting via a single metadata repository Testing and debugging frameworks ▪ Proliferation of data marts/sandboxes ▪ Accommodate compute-intensive workloads ▪ Monday, March 9, 2009
  • 23. Future Stages Potential Evolutions, pt. 2 Seamless collection of data sets from the web ▪ Wider variety of physical operators (cf. System R* through Dryad) ▪ Separate access APIs for dierent classes of users ▪ Infrastructure engineers ▪ Product engineers ▪ Data scientists ▪ Business analysts ▪ DSLs for domain-specific work ▪ Utilize browser as client (AJAX, Comet, Gears, etc.) ▪ Monday, March 9, 2009
  • 24. Future Stages Potential Evolutions, pt. 3 Workflow cloning ▪ Recommended analyses based on workload and user metadata ▪ Automatic keyword search ▪ Integrity constraint checking and enforcement ▪ Granular access controls ▪ Metadata evolution history ▪ Table statistics and Hive query optimization ▪ Utilization optimization regularized by customer satisfaction ▪ Currency-based scheduling (cf. Thomas Sandholm’s work) ▪ Monday, March 9, 2009
  • 25. Random Set of References For a more complete bibliography, just ask ▪ “The Cost of a Cloud” ▪ “Above the Clouds” ▪ “A Conversation with Jim Gray” ▪ “Rules of Thumb in Data Engineering” ▪ “Distributed Computing Economics” ▪ “From Databases to Dataspaces” ▪ Dryad and SPC papers ▪ Monday, March 9, 2009
  • 26. (c) 2009 Cloudera, Inc. or its licensors.  quot;Clouderaquot; is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Monday, March 9, 2009

×