Monday, March 1, 2010
Open Questions for Building
         An Enterprise Data Platform
         On the Cloud

         Jeff Hammerbacher
       ...
Presentation Outline
         ▪   Who am I and what am I talking about?
             ▪   My Background
             ▪   Op...
My Background
         Thanks for Asking
         ▪   hammer@cloudera.com
         ▪   Studied Mathematics at Harvard
    ...
Open Questions
         Some Context
         ▪   I don’t have a PhD
         ▪   In fact, I don’t have a publication hist...
Data Platforms
         Circumscribing our Focus
         ▪   Primarily concerned with infrastructure for analytics
      ...
Data Platforms
         Another Perspective
         ▪   Analytical infrastructure as a platform
             ▪   Infrastr...
The Cloud
         Some Terminology
         ▪   Layers of providers (looks familiar)
             ▪   Infrastructure as a...
The Cloud
         Current State
         ▪   Many infrastructure and software providers
             ▪   Rackspace, Terre...
Research Challenges
                                 Problem Statement


              What are the research challenges we...
Research Challenges
         Infrastructure
         ▪   Server and data center design
             ▪   Servers for WSCs p...
Research Challenges
         Infrastructure
         ▪   How to achieve isolation while maintaining performance?
         ...
Research Challenges
         Infrastructure
         ▪   Configuration Management
             ▪   Lots of work in industry...
Research Challenges
         Infrastructure
         ▪   Bulk data transfer
             ▪   Moving data over the WAN is s...
Research Challenges
         Interface
         ▪   Application Developers
             ▪   Incremental query progress vis...
Research Challenges
         Interface
         ▪   New data models: when to use them and how do they interact?
          ...
Research Challenges
         Interface
         ▪   Query languages
             ▪   Programmer time-to-learn and producti...
Research Challenges
         Interface
         ▪   Collaborative analytics
             ▪   User profiles, news feed, mess...
Research Challenges
         Migration
         ▪   How do we get there from here?
             ▪   Workload analysis to i...
Research Challenges
         Build Something!
         ▪   “A man who carries a cat by the tail...”
         ▪   Participa...
(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1....
Upcoming SlideShare
Loading in …5
×

20100301icde

1,414 views

Published on

Published in: Technology
  • Be the first to comment

20100301icde

  1. 1. Monday, March 1, 2010
  2. 2. Open Questions for Building An Enterprise Data Platform On the Cloud Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera March 1, 2010 Monday, March 1, 2010
  3. 3. Presentation Outline ▪ Who am I and what am I talking about? ▪ My Background ▪ Open Questions ▪ Data Platforms ▪ The Cloud ▪ Research Challenges ▪ Infrastructure ▪ Interface ▪ Migration ▪ Build something! Monday, March 1, 2010
  4. 4. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Vice President of Products and Chief Scientist ▪ Also, check out the book “Beautiful Data” Monday, March 1, 2010
  5. 5. Open Questions Some Context ▪ I don’t have a PhD ▪ In fact, I don’t have a publication history ▪ But I read a lot? ▪ Have deployed (and sometimes built) several distributed systems ▪ Oracle RAC ▪ Hadoop + Hive ▪ Cassandra ▪ New things at Cloudera ▪ Sort of like the Cubs GM asking a Cubs fan for advice Monday, March 1, 2010
  6. 6. Data Platforms Circumscribing our Focus ▪ Primarily concerned with infrastructure for analytics ▪ To borrow a phrase from Ralph Kimball ▪ Operational systems “turn the wheels” ▪ Analytical systems “watch the wheels turn” ▪ Reference architecture ▪ ETL/Data Integration ▪ DW ▪ BI ▪ Complex Analytics Monday, March 1, 2010
  7. 7. Data Platforms Another Perspective ▪ Analytical infrastructure as a platform ▪ Infrastructure providers ▪ Hardware and systems software ▪ Platform providers ▪ Suite of software tools to collect, store, manage, and analyze data ▪ Content providers ▪ Application developers ▪ End users Monday, March 1, 2010
  8. 8. The Cloud Some Terminology ▪ Layers of providers (looks familiar) ▪ Infrastructure as a Service (IaaS) ▪ Platform as a Service (PaaS) ▪ Software as a Service (SaaS) ▪ Where is it deployed? ▪ Public cloud ▪ Private cloud ▪ Hybrid cloud Monday, March 1, 2010
  9. 9. The Cloud Current State ▪ Many infrastructure and software providers ▪ Rackspace, Terremark, SoftLayer, and friends in infrastructure ▪ Salesforce and Workday in traditional enterprise applications ▪ SnapLogic, Cast Iron Systems in ETL ▪ Kognitio in DW ▪ LucidEra, PivotLink, Quantivo, and friends in BI ▪ Less developed PaaS market for analytics ▪ RightScale + Talend + Vertica + Jaspersoft partnership Monday, March 1, 2010
  10. 10. Research Challenges Problem Statement What are the research challenges we’ll encounter moving from today’s architectures for enterprise analytics to an integrated platform-as-a-service model built on public, private, or hybrid cloud infrastructure? Monday, March 1, 2010
  11. 11. Research Challenges Infrastructure ▪ Server and data center design ▪ Servers for WSCs project at Michigan ▪ FAWN at CMU: low-power CPU and SSD for storage ▪ Making use of multi-core and GPUs ▪ Power management projects all over ▪ Data center design projects ▪ Evolution of containers ▪ Yahoo!’s “chicken coop” ▪ OpenFlow, Vyatta, Arista, and Nicira in networking Monday, March 1, 2010
  12. 12. Research Challenges Infrastructure ▪ How to achieve isolation while maintaining performance? ▪ Failure isolation ▪ Performance isolation ▪ Security isolation ▪ Many interesting projects ▪ Process Groups/Containers: Solaris Zones, LXC, Job Objects ▪ Lowered VM startup time via cloning: SnowFlock ▪ Data locality for VM scheduling: Tashi ▪ Resource management for grids: Nexus Monday, March 1, 2010
  13. 13. Research Challenges Infrastructure ▪ Configuration Management ▪ Lots of work in industry: cfengine, bcfg2, Puppet, Chef ▪ Not a lot of research on the topic! ▪ Scheduling ▪ Benchmarks for concurrent queries and almost-full systems ▪ Hybrid cloud (“cloudbursting”) scheduling ▪ Scheduling in the presence of variable performance ▪ Continuous version of fault tolerance? Monday, March 1, 2010
  14. 14. Research Challenges Infrastructure ▪ Bulk data transfer ▪ Moving data over the WAN is scary ▪ Aspera, FastSoft, WAM!NET built companies out of this research ▪ UDT proposed as a protocol from Chicago ▪ Incremental progress indicators and restart would be nice ▪ Latency-sensitive requests ▪ Lower variability: better DNS? ▪ Lower latency: SPDY? Monday, March 1, 2010
  15. 15. Research Challenges Interface ▪ Application Developers ▪ Incremental query progress visualization ▪ Run time simulation and prediction ▪ ILLUSTRATE command for sample tuple generation ▪ Compile-time rather than run-time checking ▪ Libraries of basic operations which present higher-order APIs ▪ Performance optimization suggestions ▪ Distributed debugging utilities Monday, March 1, 2010
  16. 16. Research Challenges Interface ▪ New data models: when to use them and how do they interact? ▪ Multi-dimensional hash maps with locality groups: BigTable, HBase ▪ Documents: CouchDB, MongoDB, Riak (MarkLogic?) ▪ Arrays: SciDB ▪ Graphs: SHS ▪ Trajectories: TrajStore ▪ Cross-language serialization and RPC frameworks ▪ ASN.1, XDR, CORBA, ICE, Thrift, Etch, PBs, DataSeries, Avro Monday, March 1, 2010
  17. 17. Research Challenges Interface ▪ Query languages ▪ Programmer time-to-learn and productivity analysis for: ▪ Various MapReduce implementations ▪ Sawzall, PigLatin, SCOPE, Hive, DryadLINQ, ScalaQL ▪ Existing stuff: PL/SQL, TSQL, SQL*Loader, XQuery, XPath, etc.? ▪ Languages for analytics: R, S, SAS, SPSS, Matlab ▪ Can these all target a single execution layer? ▪ Should we be embedding our queries in a host language? ▪ LINQ, ScalaQL, Ferry Monday, March 1, 2010
  18. 18. Research Challenges Interface ▪ Collaborative analytics ▪ User profiles, news feed, message inboxes, recommendations ▪ Improve the browser ▪ Interactive visualization libraries in JavaScript ▪ What does HTML5 mean for the data analyst? ▪ How can we leverage multi-touch interfaces? ▪ What do new mobile devices mean for data analysts? ▪ Netbooks, iPhone, Android phones, Kindle, Nook, etc. Monday, March 1, 2010
  19. 19. Research Challenges Migration ▪ How do we get there from here? ▪ Workload analysis to identify what can be moved to PaaS first ▪ Ethnographic studies of what’s hard for data analysts today ▪ Privacy and security considerations ▪ Integration with third-party data sources ▪ Retention policies ▪ Cloud interoperability! ▪ Tools to prototype locally and deploy to platform later ▪ New university courses to build these skills Monday, March 1, 2010
  20. 20. Research Challenges Build Something! ▪ “A man who carries a cat by the tail...” ▪ Participate in an open source community ▪ Build a website and make the data available (e.g. MovieLens) ▪ Experience the joys of ▪ installation ▪ configuration ▪ deployment ▪ monitoring ▪ performance tuning, debugging, upgrades, and more! Monday, March 1, 2010
  21. 21. (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Monday, March 1, 2010

×