Galaxy of bits
Upcoming SlideShare
Loading in...5
×
 

Galaxy of bits

on

  • 850 views

 

Statistics

Views

Total Views
850
Views on SlideShare
850
Embed Views
0

Actions

Likes
0
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Share and collaborate via Windows Azure Marketplace:The Microsoft Big Data solution enables customers to share data and insights through Windows Azure Marketplace, which exposes hundreds of applications and data mining algorithms from Microsoft and third parties to help unlock unprecedented insights for customers. Microsoft’s Hadoop based service for Windows Azure offers seamless connection to Azure Marketplace through the Open Data (ODATA) Protocol.
  • Integrate with social media:Microsoft’s Big Data solution enables customers to augment their analysis with publicly available data from social media sites (such as Twitter and Facebook) and hundreds of trusted data providers on Windows Azure Marketplace. Microsoft Codename "Social Analytics" allows for integration of social information with business applications.

Galaxy of bits Galaxy of bits Presentation Transcript

  • Galaxy of bitsSurviving the flood of informationMichał Żyliński, Microsoft(michal.zylinski@microsoft.com)
  • In 2000 the Sloan Digital Sky Survey collected more data in its 1st week than was collected in the entire history of Astronomy By 2016 the New Large Synoptic Survey Telescope in Chile will acquire 140 terabytes in 5 days - more than Sloan acquired in 10 years The Large Hadron Collider at CERN generates 40 terabytes of data every second 2Sources: The Economist, Feb ‘10; IDC
  • Bing ingests > 7 petabyte a month The Twitter community generates over 1 terabyte of tweets every day Cisco predicts that by 2013 annual internet traffic flowing will reach 667 exabytes 3Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp
  • 1,800,000,00 1,8 0,000,000,00 0,000 bytes The size of Digital Universe in ZB 2011 9 8 7 6 5 Within 24 months # of intelligent devices > traditional IT devices 4 3 2 In 2015 nearly 20% 1 0 of the information will 2010 2011 2012 2015 be touched by cloudSources: IDC Digital Universe Study 2011, Worldwide Big Data Technology and Services 2012–2015 Forecast
  • HowBut... real is it?
  • Financial Retail ServicesModeling True Risk Point of SalesThreat Analysis Transaction AnalysisFraud Detection Customer ChurnTrade Surveillance AnalysisCredit scoring and Sentiment Analysisanalysis Telecommunication E- s Commerce Customer Churn PreventionRecommendationEngines Network Performance optimizationAd Targeting Call Detail RecordSearch Quality (CDR) AnalysisAbuse and click fraud Analyzing Network todetection Predict Failure
  • A day in life of typical e-commerce site
  • New exploratory e-commerce data flow
  • So how does it work? FIRST, STORE THE DATA
  • So how does it work?SECOND, TAKE THE PROCESSING TO THE DATA // Map Reduce function in JavaScript var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") {context.write(words[i].to LowerCase(), 1);} }}; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };
  • Hadoop in detailAnalysis of semi and unstructured data distributed across a commodity clusterBased on Google’s MapReduce paperand Google File system (GFS)Programs = Sequence of “map” and“reduce” tasks.Simplify writing distributed applicationsHighly fault tolerant – multiple copiesMove computation close to dataImplemented in Java and optimized forLinux
  • HDFS
  • VS
  • Traditional RDBMS MapReduceData Size Gigabytes (Terabytes) Petabytes (Hexabytes)Access Interactive and Batch BatchUpdates Read / Write many times Write once, Read many timesStructure Static Schema Dynamic SchemaIntegrity High (ACID) LowScaling Nonlinear LinearDBA Ratio 1:40 1:3000
  • Hadoop Ecosystem HBase / Cassandra Oozie Traditional BI Tools (Columnar NoSQL (Workflow) Databases) Hive Karmasphere Pig (Data (Warehouse Apache (Development Flume Sqoop Flow) and Data Mahout Tool) Access)Zookeeper (Coordination) Avro (Serialization) HBase (Column DB) MapReduce (Job Scheduling/Execution System) Hadoop = MapReduce + HDFS HDFS (Hadoop Distributed File System)
  • Hadoop + MicrosoftOur own • Submit changes back todistribution of Apache FoundationHadoop • Download for free • AD & Systems CenterOptimized for integrationWindows & Azure • Hadoop-as-a-service-on- AzureFocus on .NET • Integration with Visual StudioDevelopers • Support for C# • Performance and Scale • High Availability • Ease of use
  • Why Hadoop as a Service?• Task based billing• Easy admin• Zero install• Support a wide variety of job types – Machine Learning (mahout), Graph Mining (Pegasus), HIVE, Pig, Java, JS, etc.• Greatly simplified UI cheap fast
  • HADOOP ON AZURE
  • UNIX Pipescat [input_file] | [mapper] | sort | [reducer]>[output_file] Hadoop Streaminghadoop jar libhadoop-streaming.jar -input directory -output directory -mapper any script or executable -reducer any script or executable
  • wordcount.js
  • FIRST STEPS INMAP/REDUCE
  • PIG
  • HIVE & EXCELINTEGRATION
  • Big DataCandies
  • BenefitsKey Features Data Market integration
  • Benefits Some other fancy stuff... Models augmented with publicly available data from social media sitesKey Features Microsoft Codename "Social Analytics"
  • Wrapping up...
  • Reality check A.D. 2012 ANALYTICS SELF-SERVICE MOBILE OPERATIONAL REAL-TIME PREDICTIVE COLLABORATIVE MARKETPLACE DATA ENRICHMENT External Data and Services DISCOVER TRANSFORM SHARE AND RECOMMEND AND CLEAN AND GOVERN DATA MANAGEMENT 1 011 01RELATIONAL NON RELATIONAL MULTIDIMENSIONAL STREAMING
  • Use Case: • Extremely large volume ofMicrosoft unstructured web logBI Tools analysis • Ad hoc analysis of unstructured web logs to prototype patterns • Hadoop data feeds large 24TB Cube24 TB CubeHadoop Distribution
  • 3
  • Michal.Zylinski@microsoft.com Thank you!