The Berkeley AMPLab is developing a new open source data analysis software stack by deeply integrating machine learning and data analytics at scale (Algorithms), cloud and cluster computing (Machines) and crowdsourcing (People) to make sense of massive data. Current application efforts focus on cancer genomics, real-time traffic prediction, and collaborative analytics for mobile devices. In this talk, we present an overview of this stack and demonstrate key components: Spark and Shark.
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
It’s All Happening On-line User Generated (Web, Social & Mobile) Every: Click Ad impression Billing event ….. Fast Forward, pause,… Friend Request Transaction Network message Fault …Internet of Things / M2M Scientific Computing
Volume Petabytes+ Variety Unstructured Velocity Real-TimeOur view: More data should mean better answers • Must balance Cost, Time, and Answer Quality3
Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy) Ken Goldberg (Crowdsourcing) Randy Katz (Systems) *Michael Franklin (Databases) Dave Patterson (Systems) Armando Fox (Systems) *Ion Stoica (Systems) *Mike Jordan (Machine Learning) Scott Shenker (Networking)Organized for Collaboration: 7
• Sequencing costs (150X) Big Data $100,000.0 $K per genome $10,000.0 • UCSF cancer researchers + UCSC cancer genetic $1,000.0 $100.0 database + AMP Lab + Intel Cluster $10.0 $1.0 @TCGA: 5 PB = 20 cancers x 1000 genomes $0.1 2001 - 2014• See Dave Patterson’s Talk: Thursday 3-4, BDT205 David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 10 12/5/2011