• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
20120223keystone
 

20120223keystone

on

  • 1,467 views

 

Statistics

Views

Total Views
1,467
Views on SlideShare
1,458
Embed Views
9

Actions

Likes
6
Downloads
17
Comments
0

1 Embed 9

http://www.linkedin.com 9

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    20120223keystone 20120223keystone Presentation Transcript

    • Thursday, February 23, 12
    • Evolving an Analytical Data Platform With Applications to Medical Data Jeff Hammerbacher Chief Scientist, Cloudera February 23, 2012Thursday, February 23, 12
    • Presentation Outline ▪ 0. Context ▪ 1. Philosophy ▪ 2. Platform ▪ 3. ApplicationsThursday, February 23, 12
    • 0. ContextThursday, February 23, 12
    • Context About me ▪ Mathematics at Harvard ▪ Quant at Bear Stearns ▪ Manager, Data at Facebook ▪ Founder and Chief Scientist at Cloudera ▪ Director at Sage Foundation ▪ Teach “Introduction to Data Science” at BerkeleyThursday, February 23, 12
    • Context About Cloudera ▪ Founded in 2008 ▪ Headquarters in Palo Alto ▪ 185 employees ▪ Software ▪ CDH ▪ Cloudera Manager ▪ Training ▪ Cloudera UniversityThursday, February 23, 12
    • Context What I care about 1) Open source software for data management and analysis 2) Teaching the world to use this software effectively 3) Using this software to effect positive change in the worldThursday, February 23, 12
    • 1. PhilosophyThursday, February 23, 12
    • Philosophy ▪ The true challenges in the task of data mining ▪ Creating a data set with the relevant and accurate information ▪ Determining the appropriate analysis techniques Adapted from “Exploratory Data Mining and Data Cleaning” by Tamraparni Dasu and Ted JohnsonThursday, February 23, 12
    • Philosophy Creating a data set ▪ Store all of your data in one place ▪ Data first, questions later ▪ Store first, structure later ▪ Keep raw data foreverThursday, February 23, 12
    • Philosophy Choosing an analysis technique ▪ Enable everyone to party on the data ▪ Developers ▪ Analysts ▪ Business usersThursday, February 23, 12
    • Philosophy ▪ We have to produce tools to support the whole research cycle ▪ data capture ▪ data curation ▪ data analysis ▪ data visualization Adapted from “The Fourth Paradigm” by Jim GrayThursday, February 23, 12
    • Application Requests Application Data Database Warehouse ETL Business Analytics IntelligenceThursday, February 23, 12
    • Application Requests Application Data Database Hadoop + Hive Warehouse Business Analytics Intelligence Business Analytics IntelligenceThursday, February 23, 12
    • 2. PlatformThursday, February 23, 12
    • Platform Substrate ▪ Commodity servers ▪ Open Compute ▪ Open source operating system ▪ Linux ▪ Open source configuration management ▪ Puppet, Chef ▪ Coordination service ▪ ZooKeeperThursday, February 23, 12
    • Platform Storage ▪ Distributed schema-less storage ▪ HDFS ▪ Append-only table storage and metadata ▪ Hive ▪ Mutable table storage and metadata ▪ HBaseThursday, February 23, 12
    • Platform Compute ▪ Cluster resource management ▪ YARN ▪ Processing frameworks ▪ MapReduce, MPI ▪ High-level interfaces ▪ Crunch, PigLatin, HiveQL, Oozie ▪ Libraries ▪ DataFu, MahoutThursday, February 23, 12
    • Platform Integration ▪ Data access ▪ FUSE ▪ ODBC/JDBC ▪ Data ingest ▪ Sqoop ▪ Flume ▪ User interface ▪ HueThursday, February 23, 12
    • 3. ApplicationsThursday, February 23, 12
    • Applications FDA ▪ Phase IV/post-market analysis of drug safety ▪ Find unsuspected adverse drug events (ADEs) ▪ Adverse Event Reporting System (AERS) data is available online ▪ Used Pig to identify novel 3-drug combinations ▪ No complex algorithms requiredThursday, February 23, 12
    • Applications HIV Drug InteractionsThursday, February 23, 12
    • Applications Michael Schatz ▪ Contrail: de novo assembly of large genomes from short reads ▪ CloudBurst: parallel read mapping ▪ Crossbow: find SNPs from short read data ▪ Genome indexing: suffix array, BWT ▪ Work done at Maryland and Cold Spring HarborThursday, February 23, 12
    • Applications SeqWare Query Engine ▪ Load and query variants over thousands of genomes ▪ Handles a variety of variants and annotations ▪ Proof of concept using the U87MG genome ▪ Runs on HBase ▪ Open source ▪ Work done at UCLAThursday, February 23, 12
    • Applications Nephele ▪ Genotyping without multiple sequence alignment ▪ Represent sequence with complete composition vector ▪ Use affinity propagation clustering to group sequences ▪ Code is open source ▪ Work done at MITREThursday, February 23, 12
    • Applications Hadoop-GIS ▪ High performance queries for analytical pathology imaging ▪ Spatial query engine RESQUE ▪ Augments Hive with spatial query capabilities ▪ Will support analytical pathology imaging guided diagnosis ▪ Work done at Emory UniversityThursday, February 23, 12
    • Applications Explorys ▪ “Medical informatics platform” ▪ Search and analyze ▪ patient populations ▪ treatment protocols ▪ clinical outcomes ▪ Explorys engineer Doug Meil is an HBase committerThursday, February 23, 12
    • Applications NextBio ▪ “Integrative Genomics” ▪ Platform for integrating public and private information ▪ Literature search, automated annotation ▪ Sequence-specific data management components ▪ Pipeline powered by HadoopThursday, February 23, 12
    • Applications IBM Watson ▪ Automated diagnosisThursday, February 23, 12
    • Applications Microsoft Research ▪ Cyberchondria ▪ Understanding how web content is navigated ▪ Uses search logs for analysisThursday, February 23, 12
    • (c) 2012 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0Thursday, February 23, 12
    • Why you might not care ▪ Long-lived organizations with multiple departments ▪ Data sources primarily internal to the organization ▪ Reporting and ad hoc query workloads as important as analysis ▪ CDH strengths ▪ data capture ▪ data curation ▪ CDH weaknesses ▪ interactive query performance ▪ model fitting (optimization) ▪ linear algebra (arrays are not a primitive type)Thursday, February 23, 12