Share and collaborate via Windows Azure Marketplace:The Microsoft Big Data solution enables customers to share data and insights through Windows Azure Marketplace, which exposes hundreds of applications and data mining algorithms from Microsoft and third parties to help unlock unprecedented insights for customers. Microsoft’s Hadoop based service for Windows Azure offers seamless connection to Azure Marketplace through the Open Data (ODATA) Protocol.
Integrate with social media:Microsoft’s Big Data solution enables customers to augment their analysis with publicly available data from social media sites (such as Twitter and Facebook) and hundreds of trusted data providers on Windows Azure Marketplace. Microsoft Codename "Social Analytics" allows for integration of social information with business applications.
Galaxy of bitsSurviving the flood of informationMichał Żyliński, Microsoft(firstname.lastname@example.org)
In 2000 the Sloan Digital Sky Survey collected more data in its 1st week than was collected in the entire history of Astronomy By 2016 the New Large Synoptic Survey Telescope in Chile will acquire 140 terabytes in 5 days - more than Sloan acquired in 10 years The Large Hadron Collider at CERN generates 40 terabytes of data every second 2Sources: The Economist, Feb ‘10; IDC
Bing ingests > 7 petabyte a month The Twitter community generates over 1 terabyte of tweets every day Cisco predicts that by 2013 annual internet traffic flowing will reach 667 exabytes 3Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp
1,800,000,00 1,8 0,000,000,00 0,000 bytes The size of Digital Universe in ZB 2011 9 8 7 6 5 Within 24 months # of intelligent devices > traditional IT devices 4 3 2 In 2015 nearly 20% 1 0 of the information will 2010 2011 2012 2015 be touched by cloudSources: IDC Digital Universe Study 2011, Worldwide Big Data Technology and Services 2012–2015 Forecast
Hadoop in detailAnalysis of semi and unstructured data distributed across a commodity clusterBased on Google’s MapReduce paperand Google File system (GFS)Programs = Sequence of “map” and“reduce” tasks.Simplify writing distributed applicationsHighly fault tolerant – multiple copiesMove computation close to dataImplemented in Java and optimized forLinux
Traditional RDBMS MapReduceData Size Gigabytes (Terabytes) Petabytes (Hexabytes)Access Interactive and Batch BatchUpdates Read / Write many times Write once, Read many timesStructure Static Schema Dynamic SchemaIntegrity High (ACID) LowScaling Nonlinear LinearDBA Ratio 1:40 1:3000
Hadoop + MicrosoftOur own • Submit changes back todistribution of Apache FoundationHadoop • Download for free • AD & Systems CenterOptimized for integrationWindows & Azure • Hadoop-as-a-service-on- AzureFocus on .NET • Integration with Visual StudioDevelopers • Support for C# • Performance and Scale • High Availability • Ease of use
Why Hadoop as a Service?• Task based billing• Easy admin• Zero install• Support a wide variety of job types – Machine Learning (mahout), Graph Mining (Pegasus), HIVE, Pig, Java, JS, etc.• Greatly simplified UI cheap fast
Reality check A.D. 2012 ANALYTICS SELF-SERVICE MOBILE OPERATIONAL REAL-TIME PREDICTIVE COLLABORATIVE MARKETPLACE DATA ENRICHMENT External Data and Services DISCOVER TRANSFORM SHARE AND RECOMMEND AND CLEAN AND GOVERN DATA MANAGEMENT 1 011 01RELATIONAL NON RELATIONAL MULTIDIMENSIONAL STREAMING
Use Case: • Extremely large volume ofMicrosoft unstructured web logBI Tools analysis • Ad hoc analysis of unstructured web logs to prototype patterns • Hadoop data feeds large 24TB Cube24 TB CubeHadoop Distribution