On October 23rd, 2014, we updated our
By continuing to use LinkedIn’s SlideShare service, you agree to the revised terms, so please take a few minutes to review them.
Galaxy of bitsSurviving the flood of informationMichał Żyliński, Microsoft(email@example.com)
In 2000 the Sloan Digital Sky Survey collected more data in its 1st week than was collected in the entire history of Astronomy By 2016 the New Large Synoptic Survey Telescope in Chile will acquire 140 terabytes in 5 days - more than Sloan acquired in 10 years The Large Hadron Collider at CERN generates 40 terabytes of data every second 2Sources: The Economist, Feb ‘10; IDC
Bing ingests > 7 petabyte a month The Twitter community generates over 1 terabyte of tweets every day Cisco predicts that by 2013 annual internet traffic flowing will reach 667 exabytes 3Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp
1,800,000,00 1,8 0,000,000,00 0,000 bytes The size of Digital Universe in ZB 2011 9 8 7 6 5 Within 24 months # of intelligent devices > traditional IT devices 4 3 2 In 2015 nearly 20% 1 0 of the information will 2010 2011 2012 2015 be touched by cloudSources: IDC Digital Universe Study 2011, Worldwide Big Data Technology and Services 2012–2015 Forecast
HowBut... real is it?
Financial Retail ServicesModeling True Risk Point of SalesThreat Analysis Transaction AnalysisFraud Detection Customer ChurnTrade Surveillance AnalysisCredit scoring and Sentiment Analysisanalysis Telecommunication E- s Commerce Customer Churn PreventionRecommendationEngines Network Performance optimizationAd Targeting Call Detail RecordSearch Quality (CDR) AnalysisAbuse and click fraud Analyzing Network todetection Predict Failure
A day in life of typical e-commerce site
New exploratory e-commerce data flow
So how does it work? FIRST, STORE THE DATA
Hadoop in detailAnalysis of semi and unstructured data distributed across a commodity clusterBased on Google’s MapReduce paperand Google File system (GFS)Programs = Sequence of “map” and“reduce” tasks.Simplify writing distributed applicationsHighly fault tolerant – multiple copiesMove computation close to dataImplemented in Java and optimized forLinux
Traditional RDBMS MapReduceData Size Gigabytes (Terabytes) Petabytes (Hexabytes)Access Interactive and Batch BatchUpdates Read / Write many times Write once, Read many timesStructure Static Schema Dynamic SchemaIntegrity High (ACID) LowScaling Nonlinear LinearDBA Ratio 1:40 1:3000
Hadoop + MicrosoftOur own • Submit changes back todistribution of Apache FoundationHadoop • Download for free • AD & Systems CenterOptimized for integrationWindows & Azure • Hadoop-as-a-service-on- AzureFocus on .NET • Integration with Visual StudioDevelopers • Support for C# • Performance and Scale • High Availability • Ease of use
Why Hadoop as a Service?• Task based billing• Easy admin• Zero install• Support a wide variety of job types – Machine Learning (mahout), Graph Mining (Pegasus), HIVE, Pig, Java, JS, etc.• Greatly simplified UI cheap fast
HADOOP ON AZURE
UNIX Pipescat [input_file] | [mapper] | sort | [reducer]>[output_file] Hadoop Streaminghadoop jar libhadoop-streaming.jar -input directory -output directory -mapper any script or executable -reducer any script or executable
FIRST STEPS INMAP/REDUCE
HIVE & EXCELINTEGRATION
BenefitsKey Features Data Market integration
Benefits Some other fancy stuff... Models augmented with publicly available data from social media sitesKey Features Microsoft Codename "Social Analytics"
Reality check A.D. 2012 ANALYTICS SELF-SERVICE MOBILE OPERATIONAL REAL-TIME PREDICTIVE COLLABORATIVE MARKETPLACE DATA ENRICHMENT External Data and Services DISCOVER TRANSFORM SHARE AND RECOMMEND AND CLEAN AND GOVERN DATA MANAGEMENT 1 011 01RELATIONAL NON RELATIONAL MULTIDIMENSIONAL STREAMING
Use Case: • Extremely large volume ofMicrosoft unstructured web logBI Tools analysis • Ad hoc analysis of unstructured web logs to prototype patterns • Hadoop data feeds large 24TB Cube24 TB CubeHadoop Distribution