SlideShare a Scribd company logo
1 of 50
Building Analytical Applications on PUBLICLY
                                 DO NOT USE
    Hadoop                        PRIOR TO 10/23/12
    Headline Goes Here
    Josh Wills | Director of Data Science
    Speaker Name or Subhead Goes Here
    November 2012




1
About Me




2
What are ‘Analytical Applications?’




3
The Humble Dashboard




4
Crossfilter with Flight Information




5
New York Times Electoral Vote Map




6
New York Times Electoral Vote Map (Detail)




7
Analytical Applications vs. Frameworks




8
Developing Analytical Applications
    A Case Study




9
2012: The Predicting of the President




10
RealClearPolitics

     • Simple Average of Polls


     • Transparent


     • Simple Interactions



11
FiveThirtyEight

                       • “Foxy” Model


                       • Opaque


                       • Simple Interactions with
                        a richer UI

12
Princeton Election Consortium

     • Medians and
      Polynomials

     • Transparent


     • Rich Interactions


13
How Did They Do?




14
A Few of These, Because They’re Fun




15
A Few of These, Because They’re Fun




16
A Few of These, Because They’re Fun




17
Here’s the Rub: One Expert Beat Nate




18
Index Funds, Hedge Funds, and Warren Buffett




19
A Brief Introduction to Hadoop




20
Data Storage in 2001: Databases

     • Structured schemas
     • Intensive processing
       done where data is
       stored
     • Somewhat reliable
     • Expensive at scale


21
Data Storage in 2001: Filers

                               • No schemas, stores any
                                 kind of file
                               • No data processing
                                 capability
                               • Reliable
                               • Expensive at scale


22
And Then, This Happened




23
Data Economics: Return on Byte




24
Big Data Economics

     • No individual record is
       particularly valuable
     • Having every record is
       incredibly valuable
         •   Web index
         •   Recommendation systems
         •   Sensor data
         •   Market basket analysis
         •   Online advertising

25
Introduction to Hadoop




26
The Hadoop Distributed File System

     • Based on the Google File
       System
     • Data stored in large files
        • Large block size: 64MB to
          256MB per block
        • Blocks are replicated to
          multiple nodes in the
          cluster

27
Simple, Reliable Processing: MapReduce
     •   Map Stage
          •   Embarrassingly parallel
     • Shuffle Stage: Large-scale distributed sort
     • Reduce Stage
          •   Process all of the values that have the same key in a single step
     • Process the data where it is stored
     • Write once and you’re done.



28
Developing Analytical Applications
     with Hadoop




29
Novelty is the Enemy of Adoption




30
The Best Way to Get Started: Apache Hive

     •   Apache Hive
          •   Data Warehouse System on
              top of Hadoop
     •   SQL-based query language
          • SELECT, INSERT, CREATE
            TABLE
          • Includes some MapReduce-
            specific extensions

31
Borrowing Abstractions




32
Improving the UX (http://github.com/cloudera/impala)




33
Moving Beyond the Abstractions




34
Making the Abstract Concrete




35
Cloudera’s Data Science Course




36
Analytical Applications I Love




37
The Experiments Dashboard




38
Adverse Drug Events




39
Gene Sequencing and Analytics




40
The Doctor’s Perspective




41
A Couple of Themes
     1.   Structure data the data in the way that makes sense for the
          problem.

     2.   Interactive inputs, not just interactive outputs.

     3.   Simpler interfaces that yield more sophisticated answers.



42
Working Towards The Dream




43
Developing Analytical Applications
     Moving Beyond MapReduce




44
The Cambrian Explosion…of Frameworks 




45
It’s Frameworks All The Way Down: Spark

     • Developed at Berkeley’s
       AMP Lab
     • Defines operations on
       distributed in-memory
       collections
     • Written in Scala
     • Supports reading to and
       writing from HDFS

46
IFATWD: Graphlab

     • Developed at CMU
     • Lower-level primitives
         •   (but higher than MPI)
     • Map/Reduce =>
       Update/Sort
     • Flexible, allows for
       asynchronous
       computations
     • Reads from HDFS

47
Playing with YARN




48
BranchReduce (http://github.com/cloudera/branchreduce)




49
50

More Related Content

What's hot

Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop clusterFurqan Haider
 
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler..."Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...Dataconomy Media
 
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...Dataconomy Media
 
QuantCell Research - The Big Data Spreadsheet
QuantCell Research - The Big Data SpreadsheetQuantCell Research - The Big Data Spreadsheet
QuantCell Research - The Big Data Spreadsheetinside-BigData.com
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataDataWorks Summit
 
Data mining tools used in business intelligence
Data mining tools used in business intelligenceData mining tools used in business intelligence
Data mining tools used in business intelligenceNithya Ravi
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureBig Data Spain
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...Maya Lumbroso
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero Lars Albertsson
 
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Databricks
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data SuccessLars Albertsson
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Dataconomy Media
 

What's hot (20)

Big data technology
Big data technology Big data technology
Big data technology
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
 
From Big Data to Fast Data
From Big Data to Fast DataFrom Big Data to Fast Data
From Big Data to Fast Data
 
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler..."Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
 
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
 
QuantCell Research - The Big Data Spreadsheet
QuantCell Research - The Big Data SpreadsheetQuantCell Research - The Big Data Spreadsheet
QuantCell Research - The Big Data Spreadsheet
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
Data mining tools used in business intelligence
Data mining tools used in business intelligenceData mining tools used in business intelligence
Data mining tools used in business intelligence
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Real-Time Big Data
Real-Time Big DataReal-Time Big Data
Real-Time Big Data
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren Shure
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero
 
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data Success
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
 

Similar to Building Analytical Apps on Hadoop

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemZohar Elkayam
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersZohar Elkayam
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data miningAhmad Ammari
 
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal HistoryData Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal HistoryCloudera, Inc.
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalBig data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalramazan fırın
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageBethmi Gunasekara
 
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemThings Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemZohar Elkayam
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLLviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLzenyk
 
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527Zohar Elkayam
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Building Data Products
Building Data ProductsBuilding Data Products
Building Data ProductsCloudera, Inc.
 
Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101Peter O'Kelly
 

Similar to Building Analytical Apps on Hadoop (20)

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data mining
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal HistoryData Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalBig data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-final
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
 
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemThings Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLLviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQL
 
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Building Data Products
Building Data ProductsBuilding Data Products
Building Data Products
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101
 

More from Dmitry Makarchuk

2012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-12012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-1Dmitry Makarchuk
 
2012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-12012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-1Dmitry Makarchuk
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
A random forest approach to skin detection with r
A random forest approach to skin detection with rA random forest approach to skin detection with r
A random forest approach to skin detection with rDmitry Makarchuk
 
"Your script just killed my site" by Steve Souders
"Your script just killed my site" by Steve Souders"Your script just killed my site" by Steve Souders
"Your script just killed my site" by Steve SoudersDmitry Makarchuk
 
RBrowserPlugin Project (Gabriel Becker)
RBrowserPlugin Project (Gabriel Becker)RBrowserPlugin Project (Gabriel Becker)
RBrowserPlugin Project (Gabriel Becker)Dmitry Makarchuk
 
Jesse Yates: Hbase snapshots patch
Jesse Yates: Hbase snapshots patchJesse Yates: Hbase snapshots patch
Jesse Yates: Hbase snapshots patchDmitry Makarchuk
 
Mongo DB in gaming industry
Mongo DB in gaming industryMongo DB in gaming industry
Mongo DB in gaming industryDmitry Makarchuk
 

More from Dmitry Makarchuk (11)

Linzer slides-barug
Linzer slides-barugLinzer slides-barug
Linzer slides-barug
 
2012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-12012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-1
 
2012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-12012 11-28 rich web data modeling with graphs-1
2012 11-28 rich web data modeling with graphs-1
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
A random forest approach to skin detection with r
A random forest approach to skin detection with rA random forest approach to skin detection with r
A random forest approach to skin detection with r
 
"Your script just killed my site" by Steve Souders
"Your script just killed my site" by Steve Souders"Your script just killed my site" by Steve Souders
"Your script just killed my site" by Steve Souders
 
RBrowserPlugin Project (Gabriel Becker)
RBrowserPlugin Project (Gabriel Becker)RBrowserPlugin Project (Gabriel Becker)
RBrowserPlugin Project (Gabriel Becker)
 
Bridge to r
Bridge to rBridge to r
Bridge to r
 
Jesse Yates: Hbase snapshots patch
Jesse Yates: Hbase snapshots patchJesse Yates: Hbase snapshots patch
Jesse Yates: Hbase snapshots patch
 
Phoenix h basemeetup
Phoenix h basemeetupPhoenix h basemeetup
Phoenix h basemeetup
 
Mongo DB in gaming industry
Mongo DB in gaming industryMongo DB in gaming industry
Mongo DB in gaming industry
 

Building Analytical Apps on Hadoop

Editor's Notes

  1. They are applications that allow users to work with and make decisions from data.
  2. It seems like there should be a UX equivalent of Clippy– maybe like a tiny picture of Edward Tufte– that pops up whenever someone decides to use a 3D pie chart.
  3. http://square.github.com/crossfilter/
  4. http://elections.nytimes.com/2012/results/president (Click on “Shift from 2008”)
  5. Click on a state to zoom in
  6. Frameworks != Analytical applicatons, for our purposes today. It’s not an analytical application until you put some data in it.
  7. A few different models were developed for predicting the presidency in 2012– let’s consider a few of them.
  8. http://www.realclearpolitics.com/epolls/2012/president/2012_elections_electoral_college_map.html
  9. http://fivethirtyeight.blogs.nytimes.com/
  10. http://election.princeton.edu/
  11. http://isnatesilverawitch.com/Everyone predicted the election correctly. The RCP model got every state but Florida, PEC said it was a tossup, and 538 got every single state right.
  12. MarkosMoulitsas over at theDailyKos did even better than Nate at predicting the share of the vote within the swing states. Don’t think that math can always out-perform an expert armed with good data.http://news.cnet.com/8301-13578_3-57546778-38/among-the-top-election-quants-nate-silver-reigns-supreme/
  13. Index fund == simple average.Hedge fund == 538Warren Buffett == Expert with good data
  14. Classical data economics: If the value I can extract from a byte is greater than the cost to store it, then I throw it away or store it on tape.
  15. We use metaphors that help us understand new technology in terms of the old. Translatedesktop tools and metaphors on to Hadoop, even when we’re working with specialized data types: http://blog.cloudera.com/blog/2012/01/seismic-data-science-hadoop-use-case/
  16. It’s a data warehousing metaphor– not an actual data warehouse. Schema on read vs. schema on write, for example. Non-interactive for the most part. Think of ELT, not interactive queries.
  17. We borrow these abstractions because they make it easy to get started, but they don’t necessarily conform to the user’s expectations of how Hadoop will work.If you think of Hadoop as a really big database, or as a spreadsheet that goes on forever and ever, then you have failed to understand Hadoop.
  18. Impala is about fulfilling those abstractions, esp. for interactive queries of relational-style data on Hadoop.
  19. But we can also go beyond the abstractions and study how Hadoop can be effective for new kinds of analytic applications.
  20. Step 1: Study real problems. Especially real problems where non-sophisticated users (e.g., people who don’t even know SQL) need to do sophisticated analysis on large quantities of information.
  21. I realized earlier this year that other people do not use Hive the way that I use Hive, and so we created the data science course to take people through the problem of building an analytical application from start to finish on Hadoop.http://blog.cloudera.com/blog/2012/10/data-science-training/
  22. They are applications that allow users to work with and make decisions from data.
  23. http://blog.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-events/
  24. http://www.slideshare.net/cloudera/7-leveraging-h-base-for-the-worlds-largest-curated-genomic-data-collection-satnam-alag-nextbio-finalupdatedlastminute
  25. The truth is that building tools for unsophisticated users typically requires incredibly sophisticated development.
  26. An open-source version of Wolfram Alpha for useful data.
  27. https://github.com/cloudera/kitten
  28. http://github.com/cloudera/branchreduce