SlideShare a Scribd company logo
Big Data
                          Steve Watt   Emerging Technologies @ HP

     1
– Someday Soon (Flickr)
2
– timsnell (Flickr)
Agenda

    Hardware   Software                     Data




                                    • Big Data



                    • Situational
                    Applications




3
Situational Applications




      4
– eaghra (Flickr)
Web 2.0 Era Topic Map
                                                   Produce          Process
                                    Inexpensiv
              Data                   e Storage
            Explosion
                             LAM
 Social                       P
Platform         Publishin
    s                g
                 Platforms

                                                              Situational
                                                             Applications
              Web 2.0                    Mashups




           Enterpris          SOA
              e
5
6
Big Data




      7
– blmiers2 (Flickr)
The data just keeps growing…

 1024 GIGABYTE= 1 TERABYTE
       1024 TERABYTES = 1 PETABYTE
             1024 PETABYTES = 1 EXABYTE


1 PETABYTE 13.3 Years of HD Video

20 PETABYTES Amount of Data processed by Google daily

5 EXABYTES All words ever spoken by humanity
Mobile
  App Economy for Devices                                                        Sensor Web
  App for this     App for that                                      An instrumented and monitored world




Set Top            Tablets, etc.   Multiple Sensors in your pocket
Boxes
                                                                                                   Real-time
                                                                                                   Data

                                        The Fractured Web
                                                                                                       Opportunity
                                          Facebook       Twitter     LinkedIn
Service Economy
Service for this                          Google     NetFlix    New York Times

Service for that                           eBay          Pandora       PayPal              Web 2.0 Data Exhaust of
                                                                                           Historical and Real-time Data



                                   Web 2.0 - Connecting People                            API Foundation
 Web as a Platform
 9                                 Web 1.0 - Connecting Machines                          Infrastructure
Data Deluge! But filter patterns can
                  help…
    10
Kakadu (Flickr)
Filtering
With
Search




 11
Filtering
Socially




            Awesome
 12
Filtering
Visually




 13
But filter patterns force you down a pre-processed
  path
M.V. Jantzen (Flickr)
What if you could ask your own questions?

     15
– wowwzers(Flickr)
And go from discovering Something about Everything…

– MrB-MMX (Flickr)
To discovering Everything about Something ?

17
How do we do this?
 Lets examine a few techniques for
Gathering,
     Storing,
         Processing &

18
                Delivering Data @   Scale
Gathering Data

Data Marketplaces




 19
20
21
Gathering Data

Apache Nutch
(Web Crawler)




 22
Storing, Reading and Processing - Apache Hadoop
    Cluster technology with a single master and scale out with multiple slaves
    It consists of two runtimes:
        The Hadoop Distributed File System (HDFS)
        Map/Reduce

    As data is copied onto the HDFS it ensures the data is blocked and replicated to other
     machines to provide redundancy
    A self-contained job (workload) is written in Map/Reduce and submitted to the Hadoop
     Master which in-turn distributes the job to each slave in the cluster.
    Jobs run on data that is on the local disks of the machine they are sent to ensuring data
     locality
    Node (Slave) failures are handled automatically by Hadoop. Hadoop may execute or re-
     execute a job on any node in the cluster.

     Want to know more?
23
     “Hadoop – The Definitive Guide (2nd Edition)”
Delivering Data @ Scale

•    Structured Data
•    Low Latency & Random Access
•    Column Stores (Apache HBase or Apache Cassandra)
     •   faster seeks
     •   better compression
     •   simpler scale out
     •   De-normalized – Data is written as it is intended to be queried




         Want to know more?
24
         “HBase – The Definitive Guide” & “Cassandra High Performance
Storing, Processing & Delivering : Hadoop + NoSQL

              Gather            Read/Transfor                  Low-
                                m                              latency       Application
        Web Data
                        Nutch                                                Query
                        Crawl
                                                                                     Serve
                      Copy

                                        Apache
                                        Hadoop
 Log Files
                   Flume
                   Connector              HDFS                                 NoSQL
                                                                              Repository
                                                               NoSQL
                   SQOOP                                       Connector/A
                   Connector                                   PI

 Relational
 Data
                                -Clean and Filter Data
 (JDBC)
                                - Transform and Enrich Data
               MySQL
                                - Often multiple Hadoop jobs
   25
Some things to keep
    in mind…




     26
– Kanaka Menehune (Flickr)
Some things to keep in mind…

•    Processing arbitrary types of data (unstructured, semi-
     structured, structured) requires normalizing data with many different
     kinds of readers
     Hadoop is really great at this !
•    However, readers won’t really help you process truly unstructured data
     such as prose. For that you’re going to have to get handy with Natural
     Language Processing. But this is really hard.
     Consider using parsing services & APIs like Open Calais

     Want to know more?
27
     “Programming Pig” (O’REILLY)
Open Calais (Gnosis)




28
Statistical real-time decision making

      Capture Historical information

      Use Machine Learning to build decision making models (such as
       Classification, Clustering & Recommendation)

      Mesh real-time events (such as sensor data) against Models to make
       automated decisions




     Want to know more?
29
     “Mahout in Action”
30
Pascal Terjan (Flickr
31
32
Using Apache
Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2

For example:

http://www.crunchbase.com/companies?c=a&q=private_held
http://www.crunchbase.com/companies?c=b&q=private_held
http://www.crunchbase.com/companies?c=c&q=private_held
http://www.crunchbase.com/companies?c=d&q=private_held
...

Crawl data is stored in sequence files in the segments dir on the HDFS
 33
34
Making the data STRUCTURED




          Retrieving HTML

                Prelim Filtering on URL


          Company POJO then /t Out




35
Aargh!

My viz tool
requires
zipcodes to plot
geospatially!


  36
Apache Pig Script to Join on City to get Zip
Code and Write the results to Vertica

ZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('t') AS (State:chararray, City:chararray, ZipCode:int);

CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('t') AS

(Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);


CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);

STORE CrunchBaseZip INTO

'{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year

int, Investor int, Amount varchar(40))}’

USING com.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
Total Tech Investments By Year
Investment Funding By Sector
Total Investments By Zip Code for all Sectors

                                                                     $1.2 Billion in Boston



     $7.3 Billion in San Francisco


            $2.9 Billion in Mountain View




                                            $1.7 Billion in Austin

40
Total Investments By Zip Code for Consumer Web

        $600 Million in Seattle
                                       $1.2 Billion in Chicago


     $1.7 Billion in San Francisco




41
Total Investments By Zip Code for BioTech

                                            $1.3 Billion in Cambridge




                   $528 Million in Dallas




     $1.1 Billion in San Diego




42
Questions?

     Steve Watt swatt@hp.com

              @wattsteve

               stevewatt.blogspot.com

43

More Related Content

What's hot

Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
hybrid cloud
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
8 douetteau - dataiku - data tuesday open source 26 fev 2013
8   douetteau - dataiku - data tuesday open source 26 fev 2013 8   douetteau - dataiku - data tuesday open source 26 fev 2013
8 douetteau - dataiku - data tuesday open source 26 fev 2013
Data Tuesday
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
eakasit_dpu
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
Xuan-Chao Huang
 
hadoop @ Ibmbigdata
hadoop @ Ibmbigdatahadoop @ Ibmbigdata
hadoop @ Ibmbigdata
Eric Baldeschwieler
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Bill Graham
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
vinoth kumar
 
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBase
Age Mooij
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
Edureka!
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
Edureka!
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
guest27e6764
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
AhmedDoukh
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
Fang Mac
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Jonathan Seidman
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
DataWorks Summit
 

What's hot (20)

Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
8 douetteau - dataiku - data tuesday open source 26 fev 2013
8   douetteau - dataiku - data tuesday open source 26 fev 2013 8   douetteau - dataiku - data tuesday open source 26 fev 2013
8 douetteau - dataiku - data tuesday open source 26 fev 2013
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
hadoop @ Ibmbigdata
hadoop @ Ibmbigdatahadoop @ Ibmbigdata
hadoop @ Ibmbigdata
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBase
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
 

Viewers also liked

dcb1201 - Feature1
dcb1201 - Feature1dcb1201 - Feature1
dcb1201 - Feature1
Paul Elliott
 
P+ Special Ondernemers voor ondernemenrs
P+ Special Ondernemers voor ondernemenrsP+ Special Ondernemers voor ondernemenrs
P+ Special Ondernemers voor ondernemenrsfoostervink
 
Ryan Slauson Resume
Ryan Slauson Resume Ryan Slauson Resume
Ryan Slauson Resume
Ryan Slauson
 
Navajo st
 Navajo st Navajo st
Navajo st
altamirana
 
Lugares turísticos ecuador 1
Lugares turísticos ecuador 1 Lugares turísticos ecuador 1
Lugares turísticos ecuador 1
WilmerGarciaO
 
Test Estimation Hacks: Tips, Tricks and Tools Webinar
Test Estimation Hacks: Tips, Tricks and Tools WebinarTest Estimation Hacks: Tips, Tricks and Tools Webinar
Test Estimation Hacks: Tips, Tricks and Tools Webinar
QASymphony
 
Репутация в поиске: сайты, блоги, твиты
Репутация в поиске: сайты, блоги, твитыРепутация в поиске: сайты, блоги, твиты
Репутация в поиске: сайты, блоги, твиты
web2win
 
Surgical Audit
Surgical AuditSurgical Audit
Surgical Audit
akinbodeog
 
Transporte celular
Transporte celularTransporte celular
Transporte celular
Denise Lemos Cardoso, CEFET-MG
 

Viewers also liked (9)

dcb1201 - Feature1
dcb1201 - Feature1dcb1201 - Feature1
dcb1201 - Feature1
 
P+ Special Ondernemers voor ondernemenrs
P+ Special Ondernemers voor ondernemenrsP+ Special Ondernemers voor ondernemenrs
P+ Special Ondernemers voor ondernemenrs
 
Ryan Slauson Resume
Ryan Slauson Resume Ryan Slauson Resume
Ryan Slauson Resume
 
Navajo st
 Navajo st Navajo st
Navajo st
 
Lugares turísticos ecuador 1
Lugares turísticos ecuador 1 Lugares turísticos ecuador 1
Lugares turísticos ecuador 1
 
Test Estimation Hacks: Tips, Tricks and Tools Webinar
Test Estimation Hacks: Tips, Tricks and Tools WebinarTest Estimation Hacks: Tips, Tricks and Tools Webinar
Test Estimation Hacks: Tips, Tricks and Tools Webinar
 
Репутация в поиске: сайты, блоги, твиты
Репутация в поиске: сайты, блоги, твитыРепутация в поиске: сайты, блоги, твиты
Репутация в поиске: сайты, блоги, твиты
 
Surgical Audit
Surgical AuditSurgical Audit
Surgical Audit
 
Transporte celular
Transporte celularTransporte celular
Transporte celular
 

Similar to Steve Watt Presentation

Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
ConfluentInc1
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
Richard McDougall
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
InnoTech
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
TrendProgContest13
 
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
Michal Zylinski
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
Michael Rys
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
Juantomás García Molina
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
Milos Milovanovic
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014
Kenneth Igiri
 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Tools
boorad
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
Hortonworks
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
Darko Marjanovic
 
Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...
Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...
Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...
Keiichiro Ono
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
HostedbyConfluent
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
Big data with java
Big data with javaBig data with java
Big data with java
Stefan Angelov
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Cedric CARBONE
 
Zh tw cloud computing era
Zh tw cloud computing eraZh tw cloud computing era
Zh tw cloud computing era
TrendProgContest13
 

Similar to Steve Watt Presentation (20)

Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014
 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Tools
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...
Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...
Cytoscape Untangles the Web: a first step towards Cytoscape Cyberinfrastructu...
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
 
Zh tw cloud computing era
Zh tw cloud computing eraZh tw cloud computing era
Zh tw cloud computing era
 

Recently uploaded

Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 

Recently uploaded (20)

Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 

Steve Watt Presentation

  • 1. Big Data Steve Watt Emerging Technologies @ HP 1 – Someday Soon (Flickr)
  • 3. Agenda Hardware Software Data • Big Data • Situational Applications 3
  • 4. Situational Applications 4 – eaghra (Flickr)
  • 5. Web 2.0 Era Topic Map Produce Process Inexpensiv Data e Storage Explosion LAM Social P Platform Publishin s g Platforms Situational Applications Web 2.0 Mashups Enterpris SOA e 5
  • 6. 6
  • 7. Big Data 7 – blmiers2 (Flickr)
  • 8. The data just keeps growing… 1024 GIGABYTE= 1 TERABYTE 1024 TERABYTES = 1 PETABYTE 1024 PETABYTES = 1 EXABYTE 1 PETABYTE 13.3 Years of HD Video 20 PETABYTES Amount of Data processed by Google daily 5 EXABYTES All words ever spoken by humanity
  • 9. Mobile App Economy for Devices Sensor Web App for this App for that An instrumented and monitored world Set Top Tablets, etc. Multiple Sensors in your pocket Boxes Real-time Data The Fractured Web Opportunity Facebook Twitter LinkedIn Service Economy Service for this Google NetFlix New York Times Service for that eBay Pandora PayPal Web 2.0 Data Exhaust of Historical and Real-time Data Web 2.0 - Connecting People API Foundation Web as a Platform 9 Web 1.0 - Connecting Machines Infrastructure
  • 10. Data Deluge! But filter patterns can help… 10 Kakadu (Flickr)
  • 12. Filtering Socially Awesome 12
  • 14. But filter patterns force you down a pre-processed path M.V. Jantzen (Flickr)
  • 15. What if you could ask your own questions? 15 – wowwzers(Flickr)
  • 16. And go from discovering Something about Everything… – MrB-MMX (Flickr)
  • 17. To discovering Everything about Something ? 17
  • 18. How do we do this? Lets examine a few techniques for Gathering, Storing, Processing & 18 Delivering Data @ Scale
  • 20. 20
  • 21. 21
  • 23. Storing, Reading and Processing - Apache Hadoop  Cluster technology with a single master and scale out with multiple slaves  It consists of two runtimes:  The Hadoop Distributed File System (HDFS)  Map/Reduce  As data is copied onto the HDFS it ensures the data is blocked and replicated to other machines to provide redundancy  A self-contained job (workload) is written in Map/Reduce and submitted to the Hadoop Master which in-turn distributes the job to each slave in the cluster.  Jobs run on data that is on the local disks of the machine they are sent to ensuring data locality  Node (Slave) failures are handled automatically by Hadoop. Hadoop may execute or re- execute a job on any node in the cluster. Want to know more? 23 “Hadoop – The Definitive Guide (2nd Edition)”
  • 24. Delivering Data @ Scale • Structured Data • Low Latency & Random Access • Column Stores (Apache HBase or Apache Cassandra) • faster seeks • better compression • simpler scale out • De-normalized – Data is written as it is intended to be queried Want to know more? 24 “HBase – The Definitive Guide” & “Cassandra High Performance
  • 25. Storing, Processing & Delivering : Hadoop + NoSQL Gather Read/Transfor Low- m latency Application Web Data Nutch Query Crawl Serve Copy Apache Hadoop Log Files Flume Connector HDFS NoSQL Repository NoSQL SQOOP Connector/A Connector PI Relational Data -Clean and Filter Data (JDBC) - Transform and Enrich Data MySQL - Often multiple Hadoop jobs 25
  • 26. Some things to keep in mind… 26 – Kanaka Menehune (Flickr)
  • 27. Some things to keep in mind… • Processing arbitrary types of data (unstructured, semi- structured, structured) requires normalizing data with many different kinds of readers Hadoop is really great at this ! • However, readers won’t really help you process truly unstructured data such as prose. For that you’re going to have to get handy with Natural Language Processing. But this is really hard. Consider using parsing services & APIs like Open Calais Want to know more? 27 “Programming Pig” (O’REILLY)
  • 29. Statistical real-time decision making  Capture Historical information  Use Machine Learning to build decision making models (such as Classification, Clustering & Recommendation)  Mesh real-time events (such as sensor data) against Models to make automated decisions Want to know more? 29 “Mahout in Action”
  • 31. 31
  • 32. 32
  • 33. Using Apache Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2 For example: http://www.crunchbase.com/companies?c=a&q=private_held http://www.crunchbase.com/companies?c=b&q=private_held http://www.crunchbase.com/companies?c=c&q=private_held http://www.crunchbase.com/companies?c=d&q=private_held ... Crawl data is stored in sequence files in the segments dir on the HDFS 33
  • 34. 34
  • 35. Making the data STRUCTURED Retrieving HTML Prelim Filtering on URL Company POJO then /t Out 35
  • 36. Aargh! My viz tool requires zipcodes to plot geospatially! 36
  • 37. Apache Pig Script to Join on City to get Zip Code and Write the results to Vertica ZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('t') AS (State:chararray, City:chararray, ZipCode:int); CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('t') AS (Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int); CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State); STORE CrunchBaseZip INTO '{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year int, Investor int, Amount varchar(40))}’ USING com.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
  • 40. Total Investments By Zip Code for all Sectors $1.2 Billion in Boston $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.7 Billion in Austin 40
  • 41. Total Investments By Zip Code for Consumer Web $600 Million in Seattle $1.2 Billion in Chicago $1.7 Billion in San Francisco 41
  • 42. Total Investments By Zip Code for BioTech $1.3 Billion in Cambridge $528 Million in Dallas $1.1 Billion in San Diego 42
  • 43. Questions? Steve Watt swatt@hp.com @wattsteve stevewatt.blogspot.com 43

Editor's Notes

  1. As Hardware becomes increasing commoditized, the margin & differentiation moved to software, as software is becoming increasingly commoditized the margin & differentiation is moving to data2000 - Cloud is an IT Sourcing Alternative (Virtualization extends into Cloud)Explosion of Unstructured DataMobile“Let’s create a context in which to think….”Focused on 3 major tipping points in the evolution of the technology. Mention that this is a very web centric view contrasted to Barry Devlin’s Enterprise viewAssumes Networking falls under Hardware & Cloud is at the Intersection of Software and DataWhy should you care?Tipping Point 1: Situational ApplicationsTipping Point 2: Big DataTipping Point 3: Reasoning
  2. Web 2.0(Information Explosion, Now Many Channels - Turning consumers into Producers (Shirky),Tipping point Web Standards allow Rapid Application Development, Advent of Situational Applications, Folksonomies,Social)SOA (Functionality exposed through open interfaces and open standards, Great strides in modularity and re-use whilst reducing complexities around system integration, Still need to be a developer to create applications using theseservice interfaces (WSDL, SOAP, way too complex !) Enter mashups…)Mashups (Place a façade on the service and you have the final step in the evolution of services and service based applications,Now anyone can build applications (i.e. non-programmers). We’ve taken the entire SOA Library and exposed it to non-programmers, What do I mean? Check out this YouTunes app…) 1st example where we saw arbitrary data/content re-purposed in ways the original authors never intended –eg. Craigslist gumtree/ homes for sales scraped and placed on google map mashed up w/ crime statistics. Whole greater than the sum of its parts -> New kinds of Information !!BUT Limitations around how much arbitrary data being scraped and turned into info. Usually no pre-processing and just what can be rendered on a single page.Demo
  3. http://www.housingmaps.com/
  4. “Every 2 days we create as much data as we did from the dawn of humanity until 2003” – We’ve hit the Petabyte & Exabyte age. What does that mean? Lets look (next slide)
  5. Mention Enterprise Growth over time, Mobile/Sensor Data, Web 2.0 Data Exhaust, Social NetworksAdvances in Analytics – keep your data around for deeper business insights and to avoid Enterprise Amnesia
  6. How about we summarize a few of the key trends in the Web as we know it today …. This diagram shows some of the main trends of what Web 3.0 is about…Netflix accounts for 29.7 % of US Traffic, Mention Web 2.0 Summit Points of ControlHaving more data leads to better context which leads to deeper understanding/insight or new discoveriesRefer to Reid Hoffman’s views on what web 3.0 is
  7. Pre-processed though, not flexible, you can’t ask specific questions that have not been pre-processed
  8. Mention folksonomies in Web 2.0 with searching Delicious Bookmarks. Mention Chilean Earthquake Crisis Video using Twitter to do Crisis Mapping.
  9. Talk about Visualizations and InfoGraphics – manual and a lot of work
  10. They are only part of the solution & don’t allow you to ask your own questions
  11. This is the real promise of Big Data
  12. These are not all the problems around Big Data. These are the bigger problems around deriving new information out of web data. There are other issues as well likely inconsistency, skew, etc.
  13. Give a Nutch example
  14. Specifically call out the color coding reasoning for Map/Reduce and HDFS as a single distributed service
  15. Give examples of how one might use Open Calais or Entity Extraction libraries