SlideShare a Scribd company logo
1 of 32
Galaxy of bits
Surviving the flood of information




Michał Żyliński, Microsoft
(michal.zylinski@microsoft.com)
In 2000 the Sloan Digital Sky Survey collected more data in its 1st
                             week than was collected in the entire history of Astronomy




                          By 2016 the New Large Synoptic Survey Telescope in Chile will
                         acquire 140 terabytes in 5 days - more than Sloan acquired in 10
                                                      years



                       The Large Hadron Collider at CERN generates 40 terabytes of data
                                                 every second




                                                                               2
Sources: The Economist, Feb ‘10; IDC
Bing ingests > 7 petabyte a month



            The Twitter community generates over 1 terabyte of tweets every day


         Cisco predicts that by 2013 annual internet traffic flowing will reach 667
                                        exabytes




                                                                                        3
Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp
1,800,000,00

   1,8                                                                                                    0,000,000,00
                                                                                                          0,000 bytes
        The size of Digital Universe in
   ZB   2011
        9
        8
          7
          6
          5
                             Within 24 months #
                                                                                                          of intelligent devices >
                                                                                                          traditional IT devices
          4
          3
          2                                                                                               In 2015 nearly 20%
          1
          0                                                                                               of the information will
                      2010                 2011                 2012                 2015                 be touched by cloud
Sources: IDC Digital Universe Study 2011, Worldwide Big Data Technology and Services 2012–2015 Forecast
How
But...   real
         is it?
Financial                Retail
     Services

Modeling True Risk      Point of Sales
Threat Analysis         Transaction Analysis
Fraud Detection         Customer Churn
Trade Surveillance      Analysis
Credit scoring and      Sentiment Analysis
analysis


                         Telecommunication
      E-                 s
      Commerce          Customer Churn
                        Prevention
Recommendation
Engines                 Network Performance
                        optimization
Ad Targeting
                        Call Detail Record
Search Quality          (CDR) Analysis
Abuse and click fraud   Analyzing Network to
detection               Predict Failure
A day in life of typical e-commerce
                  site
New exploratory e-commerce data
              flow
So how does it work?
   FIRST, STORE THE DATA
So how does it work?
SECOND, TAKE THE PROCESSING TO THE DATA



                             // Map Reduce function in
                             JavaScript

                             var map = function
                             (key, value, context) {
                             var words =
                             value.split(/[^a-zA-Z]/);
                             for (var i = 0; i <
                             words.length; i++) {
                                                 if
                             (words[i] !== "")
                             {context.write(words[i].to
                             LowerCase(), 1);}
                             }};

                             var reduce = function
                             (key, values, context) {
                             var sum = 0;
                             while (values.hasNext()) {
                             sum +=
                             parseInt(values.next());
                                                 }
                             context.write(key, sum);
                             };
Hadoop in detail
Analysis of semi and unstructured data distributed across a commodity cluster

Based on Google’s MapReduce paper
and Google File system (GFS)
Programs = Sequence of “map” and
“reduce” tasks.
Simplify writing distributed applications
Highly fault tolerant – multiple copies
Move computation close to data
Implemented in Java and optimized for
Linux
HDFS
VS
Traditional RDBMS         MapReduce
Data Size   Gigabytes (Terabytes)     Petabytes (Hexabytes)
Access      Interactive and Batch     Batch
Updates     Read / Write many times   Write once, Read many times
Structure   Static Schema             Dynamic Schema
Integrity   High (ACID)               Low
Scaling     Nonlinear                 Linear
DBA Ratio   1:40                      1:3000
Hadoop Ecosystem
                                                                                              HBase / Cassandra
                                    Oozie
                                                            Traditional BI Tools              (Columnar NoSQL
                                  (Workflow)
                                                                                                 Databases)


                                             Hive
                                                       Karmasphere
                           Pig (Data      (Warehouse                       Apache
                                                       (Development                       Flume           Sqoop
                             Flow)         and Data                        Mahout
                                                           Tool)
                                            Access)
Zookeeper (Coordination)




                                                                                                                  Avro (Serialization)
                                       HBase (Column DB)


                                                MapReduce (Job Scheduling/Execution System)

                                               Hadoop = MapReduce + HDFS
                                                                   HDFS
                                                       (Hadoop Distributed File System)
Hadoop + Microsoft
Our own           • Submit changes back to
distribution of     Apache Foundation
Hadoop            • Download for free


                  • AD & Systems Center
Optimized for       integration
Windows & Azure   • Hadoop-as-a-service-on-
                    Azure

Focus on .NET     • Integration with Visual Studio
Developers        • Support for C#


                  • Performance and Scale
                  • High Availability
                  • Ease of use
Why Hadoop as a Service?
•   Task based billing
•   Easy admin
•   Zero install
•   Support a wide variety of job types
    – Machine Learning (mahout), Graph Mining
      (Pegasus), HIVE, Pig, Java, JS, etc.
• Greatly simplified UI

      cheap                        fast
HADOOP ON AZURE
UNIX Pipes
cat [input_file] | [mapper] | sort | [reducer]
>[output_file]

          Hadoop Streaming
hadoop jar libhadoop-streaming.jar
    -input directory
    -output directory
    -mapper any script or executable
    -reducer any script or executable
wordcount.js
FIRST STEPS IN
MAP/REDUCE
PIG
HIVE & EXCEL
INTEGRATION
Big Data
Candies
Benefits
Key Features
               Data Market integration
Benefits
                  Some other fancy stuff...


               Models augmented with
               publicly available data
               from social media sites
Key Features




                                         Microsoft
                                         Codename
                                         "Social Analytics"
Wrapping up...
Reality check A.D. 2012
                                ANALYTICS
              SELF-SERVICE                           MOBILE
              OPERATIONAL                           REAL-TIME
               PREDICTIVE                         COLLABORATIVE




                                                                                  MARKETPLACE
                             DATA ENRICHMENT




                                                                                                External Data
                                                                                                and Services
    DISCOVER                        TRANSFORM                     SHARE
 AND RECOMMEND                      AND CLEAN                   AND GOVERN



                             DATA MANAGEMENT
                                                                      1
                                                                      011
                                                                        01
RELATIONAL         NON RELATIONAL           MULTIDIMENSIONAL          STREAMING
Use Case:

                      • Extremely large volume of
Microsoft               unstructured web log
BI Tools                analysis

                      • Ad hoc analysis of
                        unstructured web logs to
                        prototype patterns

                      • Hadoop data feeds large
                        24TB Cube
24 TB Cube




Hadoop Distribution
3
Michal.Zylinski@microsoft.co
m


     Thank you!

More Related Content

What's hot

Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together
Yahoo!, Big Data, and Microsoft BI: Bigger and Better TogetherYahoo!, Big Data, and Microsoft BI: Bigger and Better Together
Yahoo!, Big Data, and Microsoft BI: Bigger and Better TogetherDenny Lee
 
2012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum22012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum2Wilfried Hoge
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
 
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip finalDeborah McGuinness
 
Data Culture Series - Keynote & Panel - Reading - 12th May 2015
Data Culture Series  - Keynote & Panel - Reading - 12th May 2015Data Culture Series  - Keynote & Panel - Reading - 12th May 2015
Data Culture Series - Keynote & Panel - Reading - 12th May 2015Jonathan Woodward
 
sones company presentation
sones company presentationsones company presentation
sones company presentationsones GmbH
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)Emil Eifrem
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...DESTIN-Informatique.com
 
Introduction to HADOOP
Introduction to HADOOPIntroduction to HADOOP
Introduction to HADOOPShital Kat
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)MIT College Of Engineering,Pune
 
SeCold - A Linked Data Platform for Mining Software Repositories
SeCold - A Linked Data Platform for  Mining Software RepositoriesSeCold - A Linked Data Platform for  Mining Software Repositories
SeCold - A Linked Data Platform for Mining Software Repositoriesimanmahsa
 
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Gihan Wikramanayake
 
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsMassive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsDavid Gleich
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeArangoDB Database
 

What's hot (20)

Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together
Yahoo!, Big Data, and Microsoft BI: Bigger and Better TogetherYahoo!, Big Data, and Microsoft BI: Bigger and Better Together
Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together
 
2012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum22012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum2
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
 
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
 
Data Culture Series - Keynote & Panel - Reading - 12th May 2015
Data Culture Series  - Keynote & Panel - Reading - 12th May 2015Data Culture Series  - Keynote & Panel - Reading - 12th May 2015
Data Culture Series - Keynote & Panel - Reading - 12th May 2015
 
My Master's Thesis
My Master's ThesisMy Master's Thesis
My Master's Thesis
 
sones company presentation
sones company presentationsones company presentation
sones company presentation
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Introduction to HADOOP
Introduction to HADOOPIntroduction to HADOOP
Introduction to HADOOP
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
 
SeCold - A Linked Data Platform for Mining Software Repositories
SeCold - A Linked Data Platform for  Mining Software RepositoriesSeCold - A Linked Data Platform for  Mining Software Repositories
SeCold - A Linked Data Platform for Mining Software Repositories
 
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
 
Dbm630 Lecture01
Dbm630 Lecture01Dbm630 Lecture01
Dbm630 Lecture01
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsMassive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 

Viewers also liked

Viewers also liked (9)

Silverlight i PHP
Silverlight i PHPSilverlight i PHP
Silverlight i PHP
 
StorSimple a może do chmury
StorSimple a może do chmuryStorSimple a może do chmury
StorSimple a może do chmury
 
Zmierzch epoki łowcy
Zmierzch epoki łowcyZmierzch epoki łowcy
Zmierzch epoki łowcy
 
Php i Microsoft
Php i MicrosoftPhp i Microsoft
Php i Microsoft
 
Arvore de causas
Arvore de causasArvore de causas
Arvore de causas
 
Cipa Arvore De Causas
Cipa   Arvore De CausasCipa   Arvore De Causas
Cipa Arvore De Causas
 
Procedimento de análise de acidentes e incidentes
Procedimento de análise de acidentes e incidentesProcedimento de análise de acidentes e incidentes
Procedimento de análise de acidentes e incidentes
 
Investigação e analise de acidentes
Investigação e analise de acidentesInvestigação e analise de acidentes
Investigação e analise de acidentes
 
Análise acidentes
Análise acidentes Análise acidentes
Análise acidentes
 

Similar to Galaxy of bits

Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemFour Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemTreasure Data, Inc.
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social mediaDataWorks Summit
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data ApplicationsRichard McDougall
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprisesnvvrajesh
 
Leonard Austin (Ravelin) - DevOps in a Machine Learning World
Leonard Austin (Ravelin) - DevOps in a Machine Learning WorldLeonard Austin (Ravelin) - DevOps in a Machine Learning World
Leonard Austin (Ravelin) - DevOps in a Machine Learning WorldOutlyer
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Milos Milovanovic
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Darko Marjanovic
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascadingDataiku
 
Cetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive AnalyticsCetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive AnalyticsJ. David Morris
 

Similar to Galaxy of bits (20)

Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemFour Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social media
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
Zh tw cloud computing era
Zh tw cloud computing eraZh tw cloud computing era
Zh tw cloud computing era
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data Applications
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
 
Leonard Austin (Ravelin) - DevOps in a Machine Learning World
Leonard Austin (Ravelin) - DevOps in a Machine Learning WorldLeonard Austin (Ravelin) - DevOps in a Machine Learning World
Leonard Austin (Ravelin) - DevOps in a Machine Learning World
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Big data
Big dataBig data
Big data
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
Cetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive AnalyticsCetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive Analytics
 
Cetas Predictive Analytics Prezo
Cetas Predictive Analytics PrezoCetas Predictive Analytics Prezo
Cetas Predictive Analytics Prezo
 

More from Michal Zylinski

iFIN24 – nowe spojrzenie na e-dokumenty
iFIN24 – nowe spojrzenie na e-dokumentyiFIN24 – nowe spojrzenie na e-dokumenty
iFIN24 – nowe spojrzenie na e-dokumentyMichal Zylinski
 
Dlaczego startupy potrzebują doradców? Wrażenia z Seedcamp 2009.
Dlaczego startupy potrzebują doradców? Wrażenia z Seedcamp 2009.Dlaczego startupy potrzebują doradców? Wrażenia z Seedcamp 2009.
Dlaczego startupy potrzebują doradców? Wrażenia z Seedcamp 2009.Michal Zylinski
 
Dlaczego startupy powinny dbać o wizerunek?
Dlaczego startupy powinny dbać o wizerunek?Dlaczego startupy powinny dbać o wizerunek?
Dlaczego startupy powinny dbać o wizerunek?Michal Zylinski
 
Inicjatywa Doradztwa Europejskiego
Inicjatywa Doradztwa EuropejskiegoInicjatywa Doradztwa Europejskiego
Inicjatywa Doradztwa EuropejskiegoMichal Zylinski
 
Zdobywanie serca klientów
Zdobywanie serca klientówZdobywanie serca klientów
Zdobywanie serca klientówMichal Zylinski
 
Twój własny kawałek YouTube
Twój własny kawałek YouTubeTwój własny kawałek YouTube
Twój własny kawałek YouTubeMichal Zylinski
 
Silverlight z bliska i na wylot
Silverlight z bliska i na wylotSilverlight z bliska i na wylot
Silverlight z bliska i na wylotMichal Zylinski
 
Nowości W Silverlight 3
Nowości W Silverlight 3Nowości W Silverlight 3
Nowości W Silverlight 3Michal Zylinski
 
Microsoft-Certyfikacja Aplikacji
Microsoft-Certyfikacja AplikacjiMicrosoft-Certyfikacja Aplikacji
Microsoft-Certyfikacja AplikacjiMichal Zylinski
 

More from Michal Zylinski (16)

Python i Microsoft
Python i MicrosoftPython i Microsoft
Python i Microsoft
 
PHP i microsoft
PHP i microsoftPHP i microsoft
PHP i microsoft
 
iFIN24 – nowe spojrzenie na e-dokumenty
iFIN24 – nowe spojrzenie na e-dokumentyiFIN24 – nowe spojrzenie na e-dokumenty
iFIN24 – nowe spojrzenie na e-dokumenty
 
LuceoS
LuceoSLuceoS
LuceoS
 
Domisoft
DomisoftDomisoft
Domisoft
 
User-centered design
User-centered designUser-centered design
User-centered design
 
Dlaczego startupy potrzebują doradców? Wrażenia z Seedcamp 2009.
Dlaczego startupy potrzebują doradców? Wrażenia z Seedcamp 2009.Dlaczego startupy potrzebują doradców? Wrażenia z Seedcamp 2009.
Dlaczego startupy potrzebują doradców? Wrażenia z Seedcamp 2009.
 
Dlaczego startupy powinny dbać o wizerunek?
Dlaczego startupy powinny dbać o wizerunek?Dlaczego startupy powinny dbać o wizerunek?
Dlaczego startupy powinny dbać o wizerunek?
 
Biz Spark i co dalej
Biz Spark i co dalejBiz Spark i co dalej
Biz Spark i co dalej
 
Inicjatywa Doradztwa Europejskiego
Inicjatywa Doradztwa EuropejskiegoInicjatywa Doradztwa Europejskiego
Inicjatywa Doradztwa Europejskiego
 
Zdobywanie serca klientów
Zdobywanie serca klientówZdobywanie serca klientów
Zdobywanie serca klientów
 
Twój własny kawałek YouTube
Twój własny kawałek YouTubeTwój własny kawałek YouTube
Twój własny kawałek YouTube
 
Iron Python I Dlr
Iron Python I DlrIron Python I Dlr
Iron Python I Dlr
 
Silverlight z bliska i na wylot
Silverlight z bliska i na wylotSilverlight z bliska i na wylot
Silverlight z bliska i na wylot
 
Nowości W Silverlight 3
Nowości W Silverlight 3Nowości W Silverlight 3
Nowości W Silverlight 3
 
Microsoft-Certyfikacja Aplikacji
Microsoft-Certyfikacja AplikacjiMicrosoft-Certyfikacja Aplikacji
Microsoft-Certyfikacja Aplikacji
 

Recently uploaded

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuidePixlogix Infotech
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistandanishmna97
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityVictorSzoltysek
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)Wonjun Hwang
 

Recently uploaded (20)

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistan
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 

Galaxy of bits

  • 1. Galaxy of bits Surviving the flood of information Michał Żyliński, Microsoft (michal.zylinski@microsoft.com)
  • 2. In 2000 the Sloan Digital Sky Survey collected more data in its 1st week than was collected in the entire history of Astronomy By 2016 the New Large Synoptic Survey Telescope in Chile will acquire 140 terabytes in 5 days - more than Sloan acquired in 10 years The Large Hadron Collider at CERN generates 40 terabytes of data every second 2 Sources: The Economist, Feb ‘10; IDC
  • 3. Bing ingests > 7 petabyte a month The Twitter community generates over 1 terabyte of tweets every day Cisco predicts that by 2013 annual internet traffic flowing will reach 667 exabytes 3 Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp
  • 4. 1,800,000,00 1,8 0,000,000,00 0,000 bytes The size of Digital Universe in ZB 2011 9 8 7 6 5 Within 24 months # of intelligent devices > traditional IT devices 4 3 2 In 2015 nearly 20% 1 0 of the information will 2010 2011 2012 2015 be touched by cloud Sources: IDC Digital Universe Study 2011, Worldwide Big Data Technology and Services 2012–2015 Forecast
  • 5. How But... real is it?
  • 6. Financial Retail Services Modeling True Risk Point of Sales Threat Analysis Transaction Analysis Fraud Detection Customer Churn Trade Surveillance Analysis Credit scoring and Sentiment Analysis analysis Telecommunication E- s Commerce Customer Churn Prevention Recommendation Engines Network Performance optimization Ad Targeting Call Detail Record Search Quality (CDR) Analysis Abuse and click fraud Analyzing Network to detection Predict Failure
  • 7. A day in life of typical e-commerce site
  • 9. So how does it work? FIRST, STORE THE DATA
  • 10. So how does it work? SECOND, TAKE THE PROCESSING TO THE DATA // Map Reduce function in JavaScript var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") {context.write(words[i].to LowerCase(), 1);} }}; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };
  • 11. Hadoop in detail Analysis of semi and unstructured data distributed across a commodity cluster Based on Google’s MapReduce paper and Google File system (GFS) Programs = Sequence of “map” and “reduce” tasks. Simplify writing distributed applications Highly fault tolerant – multiple copies Move computation close to data Implemented in Java and optimized for Linux
  • 12. HDFS
  • 13. VS
  • 14. Traditional RDBMS MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear DBA Ratio 1:40 1:3000
  • 15. Hadoop Ecosystem HBase / Cassandra Oozie Traditional BI Tools (Columnar NoSQL (Workflow) Databases) Hive Karmasphere Pig (Data (Warehouse Apache (Development Flume Sqoop Flow) and Data Mahout Tool) Access) Zookeeper (Coordination) Avro (Serialization) HBase (Column DB) MapReduce (Job Scheduling/Execution System) Hadoop = MapReduce + HDFS HDFS (Hadoop Distributed File System)
  • 16. Hadoop + Microsoft Our own • Submit changes back to distribution of Apache Foundation Hadoop • Download for free • AD & Systems Center Optimized for integration Windows & Azure • Hadoop-as-a-service-on- Azure Focus on .NET • Integration with Visual Studio Developers • Support for C# • Performance and Scale • High Availability • Ease of use
  • 17. Why Hadoop as a Service? • Task based billing • Easy admin • Zero install • Support a wide variety of job types – Machine Learning (mahout), Graph Mining (Pegasus), HIVE, Pig, Java, JS, etc. • Greatly simplified UI cheap fast
  • 18.
  • 20. UNIX Pipes cat [input_file] | [mapper] | sort | [reducer] >[output_file] Hadoop Streaming hadoop jar libhadoop-streaming.jar -input directory -output directory -mapper any script or executable -reducer any script or executable
  • 23. PIG
  • 26. Benefits Key Features Data Market integration
  • 27. Benefits Some other fancy stuff... Models augmented with publicly available data from social media sites Key Features Microsoft Codename "Social Analytics"
  • 29. Reality check A.D. 2012 ANALYTICS SELF-SERVICE MOBILE OPERATIONAL REAL-TIME PREDICTIVE COLLABORATIVE MARKETPLACE DATA ENRICHMENT External Data and Services DISCOVER TRANSFORM SHARE AND RECOMMEND AND CLEAN AND GOVERN DATA MANAGEMENT 1 011 01 RELATIONAL NON RELATIONAL MULTIDIMENSIONAL STREAMING
  • 30. Use Case: • Extremely large volume of Microsoft unstructured web log BI Tools analysis • Ad hoc analysis of unstructured web logs to prototype patterns • Hadoop data feeds large 24TB Cube 24 TB Cube Hadoop Distribution
  • 31. 3

Editor's Notes

  1. Share and collaborate via Windows Azure Marketplace:The Microsoft Big Data solution enables customers to share data and insights through Windows Azure Marketplace, which exposes hundreds of applications and data mining algorithms from Microsoft and third parties to help unlock unprecedented insights for customers. Microsoft’s Hadoop based service for Windows Azure offers seamless connection to Azure Marketplace through the Open Data (ODATA) Protocol.
  2. Integrate with social media:Microsoft’s Big Data solution enables customers to augment their analysis with publicly available data from social media sites (such as Twitter and Facebook) and hundreds of trusted data providers on Windows Azure Marketplace. Microsoft Codename &quot;Social Analytics&quot; allows for integration of social information with business applications.