Hadoop vs. RDBMS for Advanced Analytics

4,246 views

Published on

Published in: Technology, Education
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,246
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
0
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • How do you know you have a unit of analysis problem? You’re doing a bunch of COUNT DISTINCT queries. You’re doing LAG/LEAD-style queries, or using a cursor.
  • Hadoop vs. RDBMS for Advanced Analytics

    1. 1. Hadoop vs. RDBMS forAdvanced AnalyticsJosh WillsApril 26th, 2012
    2. 2. About Me• jwills@cloudera.com• Formerly of Google (2008 – 2011) • Worked on the ad auction • Led the team that build the data infrastructure for Google+• Before that: a bunch of startups • Sometimes as a software engineer, sometimes as a statistician• Math degree from Duke and a half-finished PhD from The University of Texas at Austin• Now: Director of Data Science at Cloudera Copyright 2012 Cloudera Inc. All rights reserved
    3. 3. Getting Started with Hadoop: Apache Hive • Stick with the relational models that you are used to working with • Great for the common starter use cases • Logs processing • Online data archival • ETL/ELT Copyright 2012 Cloudera Inc. All rights reserved
    4. 4. Hadoop for Advanced AnalyticsWhen Should I Use Hadoop instead of an RDBMS? Copyright 2012 Cloudera Inc. All rights reserved
    5. 5. First Symptom: COUNT DISTINCT Copyright 2012 Cloudera Inc. All rights reserved
    6. 6. Second Symptom: Cursors Copyright 2012 Cloudera Inc. All rights reserved
    7. 7. Third Symptom: ALTER TABLE OF_DOOM Copyright 2012 Cloudera Inc. All rights reserved
    8. 8. The Unit of Analysis Problem • Data warehouses are optimized to analyze transactions • Awesome for finance and ERP • Not ideal for product and marketing • A function of what databases are good at Copyright 2012 Cloudera Inc. All rights reserved
    9. 9. What Are You Trying to Analyze? Simple Entities Complex Entities • Static attributes • Evolving attributes • Flat data structure • Hierarchical data structure • Transient • Persistent • Examples • Examples • SKUs • Customers • Line items from an invoice • Suppliers • Log messages • Website visitors Copyright 2011 Cloudera Inc. All rights reserved
    10. 10. Rods and Cones vs. Facial Recognition Copyright 2012 Cloudera Inc. All rights reserved
    11. 11. Structure the Data to Fit the Problem • HDFS Lets Us Store Our Data However We Want • We can choose storage schemas that are: • Flexible • Evolvable • Compact • Fast serialization/deserializati on Copyright 2012 Cloudera Inc. All rights reserved
    12. 12. Advaned Analytics: Use Cases Copyright 2012 Cloudera Inc. All rights reserved
    13. 13. Simple Counts on Complex Objects Copyright 2012 Cloudera Inc. All rights reserved
    14. 14. Self-Self-Self-Joins Copyright 2012 Cloudera Inc. All rights reserved
    15. 15. Matching Problems Copyright 2012 Cloudera Inc. All rights reserved
    16. 16. We’re Hiring.jwills@cloudera.com

    ×