4. About Me…
• Hackerpreneur
• Co-Founder Tellago, Tellago Studios, Inc.
• Microsoft Architect Advisor
• Microsoft MVP
• Oracle ACE
• Speaker, Author
• http://weblogs.asp.net/gsusx
• http://jrodthoughts.com
• http://moesion.com
5. Agenda
• Big Data Overview
• MS HDInsight
– Map Reduce
– HDFS
– Hive
– Pig
– Sqoop
• HDInsight Service
• The Hadoop Ecosystem
• The Future….
6. Big Data?
• A bunch of data?
• An industry?
• An expertise?
• A trend?
• A cliché?
7. A Clue?
• 2008: Google processes 20 PB a day
• 2009: Facebook has 2.5 PB user
data + 15 TB/day
• 2009: eBay has 6.5 PB user data +
50 TB/day
• 2011: Yahoo! has 180-200 PB of data
• 2012: Facebook ingests 500 TB/day
17. Hadoop Design Principles
• System Shall Manage and Heal Itself
• Performance Shall Scale Linearly
• Compute Shall Move to Data
• Simple Core, Modular and Extensible
18. Hadoop History
• 2002-2004: Doug Cutting and Mike Cafarella started working on Nutch
• 2003-2004: Google publishes GFS and MapReduce papers
• 2004: Cutting adds DFS & MapReduce support to Nutch
• 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch
• 2007: NY Times converts 4TB of archives over 100 EC2s
• 2008: Web-scale deployments at Y!, Facebook, Last.fm
• April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes
• May 2009:
– Yahoo does fastest sort of a TB, 62secs over 1460 nodes
– Yahoo sorts a PB in 16.25hours over 3658 nodes
• June 2009, Oct 2009: Hadoop Summit, Hadoop World
• September 2009: Doug Cutting joins Cloudera
23. HDFS Is…
• A distributed file system
• Redundant storage
• Designed to reliably store data using
commodity hardware
• Designed to expect hardware failures
• Intended for large files
• Designed for batch inserts
• The Hadoop Distributed File System
24. HDFS at a Glance
Block Size = 64MB
Replication Factor = 3
Cost/GB is a few ¢/month
vs $/month
27. Map Reduce Is…
• A programming model for expressing
distributed computations at a massive
scale
• An execution framework for organizing
and performing such computations
• An open-source implementation called
Hadoop
31. Hive Is…
• A system for managing and querying structured data
built on top of Hadoop
– Map-Reduce for execution
– HDFS for storage
– Metadata on raw files
• Key Building Principles:
– SQL as a familiar data warehousing tool
– Extensibility – Types, Functions, Formats, Scripts
– Scalability and Performance
35. Pig Is…
Apache Pig is a platform for analyzing large data sets that consists of a
high-level language (PigLatin) for expressing data analysis programs,
coupled with infrastructure for evaluating these programs.
• Ease of programming
• Optimization opportunities
• Extensibility
• Built upon Hadoop
49. Some Challenges
• Hadoop doesn’t power big data applications
– Not a transactional datastore. Slosh back and forth via
ETL
• Processing latency
– Non-incremental, must re-slurp entire dataset every
pass
• Ad-Hoc queries
– Bare metal interface, data import
• Graphs
– Only a handful of graph problems amenable to MR
52. Takeaways
• Hadoop provides the foundation of big
data solutions
• Computing and storage are the
fundamental components of Hadoop
• HDInsight Server and Service are
Microsoft’s distributions of Hadoop
• HDInsight is just one component of
Microsoft’s BI strategy