Big Data in the Microsoft Platform

Building Big Data Solutions in
the Microsoft Platform
Jesus Rodriguez
Tellago, Inc, Tellago Studios

About Me…
• Hackerpreneur
• Co-Founder Tellago, Tellago Studios, Inc.
• Microsoft Architect Advisor
• Microsoft MVP
• Oracle ACE
• Speaker, Author
• http://weblogs.asp.net/gsusx
• http://jrodthoughts.com
• http://moesion.com

Agenda
• Big Data Overview
• MS HDInsight
– Map Reduce
– HDFS
– Hive
– Pig
– Sqoop
• HDInsight Service
• The Hadoop Ecosystem
• The Future….

Big Data?
• A bunch of data?
• An industry?
• An expertise?
• A trend?
• A cliché?

A Clue?
• 2008: Google processes 20 PB a day
• 2009: Facebook has 2.5 PB user
data + 15 TB/day
• 2009: eBay has 6.5 PB user data +
50 TB/day
• 2011: Yahoo! has 180-200 PB of data
• 2012: Facebook ingests 500 TB/day

Processing Large Amounts of
Data is Complicated....

Sucessful Big Data = Scalable
Computing + Large Storage

Parallel Data Computing is
Complicated

Hadoop Design Principles
• System Shall Manage and Heal Itself
• Performance Shall Scale Linearly
• Compute Shall Move to Data
• Simple Core, Modular and Extensible

Hadoop History
• 2002-2004: Doug Cutting and Mike Cafarella started working on Nutch
• 2003-2004: Google publishes GFS and MapReduce papers
• 2004: Cutting adds DFS & MapReduce support to Nutch
• 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch
• 2007: NY Times converts 4TB of archives over 100 EC2s
• 2008: Web-scale deployments at Y!, Facebook, Last.fm
• April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes
• May 2009:
– Yahoo does fastest sort of a TB, 62secs over 1460 nodes
– Yahoo sorts a PB in 16.25hours over 3658 nodes
• June 2009, Oct 2009: Hadoop Summit, Hadoop World
• September 2009: Doug Cutting joins Cloudera

Hadoop Ecosystem
ETL Tools BI Reporting RDBMS
Zookeepr (Coordination)

Pig (Data Flow) Hive (SQL) Sqoop

Avro (Serialization)
MapReduce (Job Scheduling/Execution System)

HBase (key-value store) (Streaming/Pipes APIs)

HDFS
(Hadoop Distributed File System)

HDFS Is…
• A distributed file system
• Redundant storage
• Designed to reliably store data using
commodity hardware
• Designed to expect hardware failures
• Intended for large files
• Designed for batch inserts
• The Hadoop Distributed File System

HDFS at a Glance
Block Size = 64MB
Replication Factor = 3

Cost/GB is a few ¢/month
vs $/month

Map Reduce Is…
• A programming model for expressing
distributed computations at a massive
scale
• An execution framework for organizing
and performing such computations
• An open-source implementation called
Hadoop

Hive Is…
• A system for managing and querying structured data
built on top of Hadoop
– Map-Reduce for execution
– HDFS for storage
– Metadata on raw files

• Key Building Principles:
– SQL as a familiar data warehousing tool
– Extensibility – Types, Functions, Formats, Scripts
– Scalability and Performance

Pig Is…
Apache Pig is a platform for analyzing large data sets that consists of a
high-level language (PigLatin) for expressing data analysis programs,
coupled with infrastructure for evaluating these programs.

• Ease of programming

• Optimization opportunities

• Extensibility

• Built upon Hadoop

Pig Architecture
Grunt (Interactive shell) PigServer (Java API)

Parser (PigLatinLogicalPlan)

Optimizer (LogicalPlan  LogicalPlan)
Pig Context
Compiler (LogicalPlan  PhysiclaPlan  MapReducePlan)

ExecutionEngine

Hadoop

HDInsight
Rocking Data Processing
with Pig

Sqoop Is…
• Easy import of data from many
databases to HDFS
• Generates code for use in MapReduce
applications
• Integrates with Hive

HDInsight
Bulk Data Loading Using
Sqoop

HDInsight Service Architecture

HDInsight
HDInsight Service
Overview

Hadoop is not a silver bullet...

Some Challenges
• Hadoop doesn’t power big data applications
– Not a transactional datastore. Slosh back and forth via
ETL
• Processing latency
– Non-incremental, must re-slurp entire dataset every
pass
• Ad-Hoc queries
– Bare metal interface, data import
• Graphs
– Only a handful of graph problems amenable to MR

Beyond Hadoop
• Percolator(incremental processing)
http://research.google.com/pubs/pub36726.html
• Dremel(ad-hoc analysis queries)
http://research.google.com/pubs/pub36632.html
• Pregel (Big graphs)
http://dl.acm.org/citation.cfm?id=1807184

Takeaways
• Hadoop provides the foundation of big
data solutions
• Computing and storage are the
fundamental components of Hadoop
• HDInsight Server and Service are
Microsoft’s distributions of Hadoop
• HDInsight is just one component of
Microsoft’s BI strategy

Thanks
jesus.rodriguez@tellago.com
http://www.tellagostudios.com
http://jrodthoughts.com
http://twitter.com/#!/jrodthoughts
http://weblogs.asp.net/gsusx

Big Data in the Microsoft Platform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data in the Microsoft Platform

Similar to Big Data in the Microsoft Platform (20)

More from Jesus Rodriguez

More from Jesus Rodriguez (20)

Recently uploaded

Recently uploaded (20)

Big Data in the Microsoft Platform