Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Anthony Joseph


Published on

We make sense of the world around us by turning data into information. For years, research in fields such as machine learning (ML), data mining, databases, information retrieval, natural language processing, and speech recognition have steadily improved their techniques for revealing the information lying within otherwise opaque datasets. But computer science is now on the verge of a new era in data analysis because of several recent developments, including: the rise of the warehouse-scale computer, the massive explosion in online data, the increasing diversity and time-sensitivity of queries, and the advent of crowdsourcing. Together these trends — often referred to collectively as Big Data — have the potential for ushering in a new era in data analysis, but to realize this opportunity requires us to confront several significant scientific challenges. This talk will discuss some of these challenges in the context of academic and industrial research in the United States.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Anthony Joseph

  1. 1. 5/10/2012 A Berkeley View of Big Data Anthony D. Joseph UC Berkeley EDUSERV Symposium 10 May 2012 Who Am I? • Research: – Internet-scale systems (RAD Lab, AMP Lab) – Security (DETERlab Testbed) – Adversarial machine learning (SecML) • Teaching (undergrad/grad): operating systems and systems, security, networking Disclaimer: I don’t speak for UC or our research sponsorsAMPLab Overview 1
  2. 2. 5/10/2012 Big Data is Massive… • Facebook: – 130TB/day: user logs – 200-400TB/day: 83 million pictures – >40 Billion photos • Google: > 25 PB/day processed data • Data generated by LHC: 1 PB/sec • Total data created in 2010: 1.ZettaByte (1,000,000 PB)/year – ~60% increase every year 3 …and Diverse… • Walmart – >1 million customer transactions/hr – >2.5 PByte customer DB • Human genome sequencing – Analyzing 3 billion base pairs – Ten years for first one (2003) – Today, less than one week 4AMPLab Overview 2
  3. 3. 5/10/2012 …and Novel… • Analyzing data from user behavior vs user input • USGS TED – Twitter-based Earthquake Detector • Google Trends: “nowcasting” – – US 2009 “Cash for Clunkers” program success – US State unemployment rates 5 …and Grows Bigger… • More and more devices • More and more people • Cheaper and cheaper storage – ~50% increase in GB/$ every year 6AMPLab Overview 3
  4. 4. 5/10/2012 …and Bigger! • Log everything! – Don’t always know what question you’ll need to answer • Stored data growing faster than both available storage and GB/$ 7 Which Big Data to Keep? • Hard to decide what to delete – Thankless decision: people know only when you are wrong! – “Climate Research Unit (CRU) scientists admit they threw away key data used in global warming calculations” 8AMPLab Overview 4
  5. 5. 5/10/2012 Data Retention Requirements • New NSF data retention requirements – Proposals submitted after 18 January 2011 must include a “Data Management Plan” – Have to keep all data (including metadata) for 3 years after research award conclusion – Institutional/org considerations: • Opportunity to invest in pooled storage: campus, systemwide, regional, … • Typical cost: 8TB chunks at $1.44/GB/year collaborative space and $0.17/GB/year for archive 9 space Big Data Isn’t Always Big Data that is expensive to manage, and hard to extract value from • You don’t need to be big to have big data problem! – Inadequate tools to analyze data – Data management may dominate infrastructure cost 10AMPLab Overview 5
  6. 6. 5/10/2012 Big Data is not Cheap! • Storing and managing 1PB data: $500K-$1M/ year – Facebook: 200 PB/year 100% • “Typical” cloud-based Infrastructure cost 80% service startup (e.g., 60% ~1PB storage capacity Conviva) 40% – Log storage dominates 20% infrastructure cost 0% 2007 2008 2009 2010 Storage cluster Other 11 Hard to Extract Value from Data! • Data is – Diverse, variety of sources – Uncurated, no schema, inconsistent semantics, syntax – Integration a huge challenge • No easy way to get answers that are – High-quality – Timely • Challenge: maximize value from data by getting best possible answers 12AMPLab Overview 6
  7. 7. 5/10/2012 Requires Multifaceted Approach • Three dimensions to improve data analysis – Improving scale, efficiency, and quality of algorithms (Algorithms) – Scaling up datacenters (Machines) – Leverage human activity and intelligence (People) • Need to adaptively and flexibly combine all three dimensions 13 The State of the Art • Today’s apps: fixed point in solution space Algorithms Watson/IBM search Machines People Need techniques to dynamically pick best 14 operating pointAMPLab Overview 7
  8. 8. 5/10/2012 What Is the Big Data Problem? • For two main reasons: – the more data the greater chance to find any pattern you’d like to find • the more rows in a table, the more columns • the more columns, the more hypotheses that can be considered • indeed, the number of hypotheses grows exponentially in the number of columns – the more data the less likely a sophisticated ML algorithm will run in an acceptable time frame • and then we have to back off to cheaper algorithms that may be more error-prone A Formulation of the Problem • Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound) – This is far from being achieved in the current state of the literature! • It can be achieved by building a scalable system that blends statistical and computational design principlesAMPLab Overview 8
  9. 9. 5/10/2012 Big Data in the US • Many Fortune 1000+ companies with huge write once, read none big data collections – For all the reasons I’ve already outlined… • US Government agencies in same situation – New R&D funding • Many companies developing proprietary solutions • Very active open source big data tools committee – Broad international participation – Data Without Borders helping non-profits through pro bono data collection, analysis, and visualization 17 Significant USG Investment • 29 March 2012 – US federal agencies announced more than $200 million in new commitments – Dept of Defense, Dept of Homeland Security, Dept of Energy, Veterans Administration, Office of Scientific and Technical Information, Health and Human Services, Food and Drug Admin, National Archives & Records Admin, National Aerospace & Space Admin, National Institutes of Health, National Science Foundation, National Security Agency, US Geological 18 ServiceAMPLab Overview 9
  10. 10. 5/10/2012 Active Open Source Community • On-going development of several elements of Big Data analysis pipeline • Apache Hadoop (MapReduce) • Hive • Apache Pig • R / Octave • Much more is needed! • E.g., new analysis environments 19 The AMP Lab Make sense of data at scale by tightly integrating algorithms, machines, and people Algorithms Watson/IBM search Machines 20 PeopleAMPLab Overview 10
  11. 11. 5/10/2012 AMP Faculty and Sponsors • Faculty – Alex Bayen (mobile sensing platforms) – Armando Fox (systems) – Michael Franklin (databases): Director – Michael Jordan (machine learning): Co-director – Anthony Joseph (security & privacy) – Randy Katz (systems) – David Patterson (systems) – Ion Stoica (systems): Co-director – Scott Shenker (networking) • Sponsors: 21 Algorithms • State-of-art Machine Learning (ML) algorithms do not scale – Prohibitive to process all data points Estimate true answer How do you know when to stop? # of data points 22AMPLab Overview 11
  12. 12. 5/10/2012 Algorithms • Given any problem, data and a budget – Immediate results with continuous improvement – Calibrate answer: provide error bars Estimate true answer Error bars on every answer! # of data points 23 Algorithms • Given any problem, data and a time budget – Immediate results with continuous improvement – Calibrate answer: provide error bars Estimate true answer Stop when error smaller than a given threshold # of data points 24 timeAMPLab Overview 12
  13. 13. 5/10/2012 Algorithms • Given any problem, data and a time budget – Automatically pick the best algorithm Estimate simple true answer sophisticated error pick too high sophisticated pick simple time 25 Machines • “The datacenter as a computer” still in its infancy – Special purpose clusters, e.g., Hadoop cluster – Highly variable performance – Hard to program – Hard to debug =? 26AMPLab Overview 13
  14. 14. 5/10/2012 Machines • Make datacenter a real computer! • Share datacenter between multiple cluster computing apps • Provide new abstractions and services AMP stack Datacenter “OS” (e.g., Mesos) Existing Node OS Node OS … Node OS (e.g. Linux) (e.g. Windows) (e.g. Linux) stack 27 Machines • Make datacenter a real computer! Support existing Hive Cassandra Hypertbale cluster computing MPI Hadoop … apps AMP stack Datacenter “OS” (e.g., Mesos) Node OS Node OS Node OS Existing … stack (e.g. Linux) (e.g. Windows) (e.g. Linux) 28AMPLab Overview 14
  15. 15. 5/10/2012 Machines • Make datacenter a real computer! Predictive & Support interactive insightful query and iterative data language analysis (e.g., ML Hive Cassandra Hypertbale PIQL Spark MPI algorithms)… Hadoop … SCADS AMP stack Consistency Datacenter “OS” (e.g., Mesos) adjustable data Node OS Node OS store Node OS Existing … stack (e.g. Linux) (e.g. Windows) (e.g. Linux) 29 Machines • Make datacenter a real computer! Applications, tools Hive Cassandra Hypertbale PIQL Spark • Advanced ML algorithms MPI Hadoop … … • Interactive data mining SCADS AMP • Collaborative visualization stack Datacenter “OS” (e.g., Mesos) Node OS Node OS Node OS Existing … stack (e.g. Linux) (e.g. Windows) (e.g. Linux) 30AMPLab Overview 15
  16. 16. 5/10/2012 People • Humans can make sense of messy data! 31 People • Make people an integrated part of the system! – Leverage human activity Machines + – Leverage human intelligence (crowdsourcing): Algorithms • Curate and clean dirty data Questions activity Answers • Answer imprecise questions data, • Test and improve algorithms • Challenge – Inconsistent answer quality in all dimensions (e.g., type of question, time, cost) 32AMPLab Overview 16
  17. 17. 5/10/2012 Real Applications • Mobile Millennium Project – Alex Bayen, Civil and Environment Engineering, UC Berkeley • Microsimulation of urban development – Paul Waddell, College of Environment Design, UC Berkeley • Crowd based opinion formation – Ken Goldberg, Industrial Engineering and Operations Research, UC Berkeley • Personalized Sequencing – Taylor Sittler, UCSF 33 Personalized Sequencing 34AMPLab Overview 17
  18. 18. 5/10/2012 The AMP Lab Make sense of data at scale by tightly integrating algorithms, machines, and people Algorithms Microsimulation Mobile Millennium Sequencing Machines 35 People Big Data in 2020 Are you prepared? • To create a new generation of big data scientist • For ML to become an engineering discipline • For people to be deeply integrated in big data analysis pipeline • Will your institution – offer a big data curriculum touching all fields? – have hired cross-disciplinary faculty? – have invested in (pooled) storage infrastructure? – have invested in public/private clouds? 36 – have built inter/intra campus networks?AMPLab Overview 18
  19. 19. 5/10/2012 Summary • Goal: Tame Big Data Problem – Get results with right quality at the right time • Approach: Holistic integration of Algorithms, Machines, and People • Huge research issues across many domains 37AMPLab Overview 19