SlideShare a Scribd company logo
1 of 25
Hadoop as a Data Refinery

Steve Loughran– Hortonworks
@steveloughran
London, October 2012




© Hortonworks Inc. 2012
About me:
• HP Labs:
   –Deployment, cloud infrastructure, Hadoop-in-Cloud
• Apache – member and committer
   –Ant, Axis ; author: Ant in Action
   –Hadoop
       –Dynamic deployments
       –Diagnostics on failures
       –Cloud infrastructure integration
• Joined Hortonworks in 2012
   –UK based: R&D



                                                        Page 2
      © Hortonworks Inc. 2012
What is Apache Hadoop?


• Collection of Open Source Projects          One of the best examples of
   – Apache Software Foundation (ASF)        open source driving innovation
   – commercial and community development       and creating a market



                                       • Foundation for Big Data Solutions
                                            – Stores petabytes of data reliably
                                            – Runs highly distributed computation
                                            – Commodity servers & storage
                                            – Powers data-driven business




                                                                           Page 3
          © Hortonworks Inc. 2012
Why Hadoop?
    Business Pressure
1   Opportunity to enable innovative new business models

2   Potential new insights that drive competitive advantage

    Technical Pressure
3   Data collected and stored continues to grow exponentially

4   Data is increasingly everywhere and in many formats

5   Traditional solutions not designed for new requirements

    Financial Pressure
6   Cost of data systems, as % of IT spend, continues to grow

7   Cost advantages of commodity hardware & open source

                                                                Page 4
      © Hortonworks Inc. 2012
The data refinery in an enterprise
 Audio,                                 Web, Mobile, CRM,
 Video,                                      ERP, SCM, …
Images
           New Data                                           Business
                                                            Transactions
 Docs,     Sources
 Text,                                                      & Interactions
 XML

                                         HDFS
  Web
 Logs,
 Clicks
                           Big Data
                                                             SQL   NoSQL     NewSQL
Social,                    Refinery
Graph,                                                                                ETL
Feeds

                                                             EDW    MPP      NewSQL
Sensors,
Devices,
 RFID

                                                               Business
                                                    Pig
Spatial,                                                      Intelligence
 GPS                    Apache Hadoop
                                                              & Analytics
Events,
 Other                                   Dashboards, Reports,
                                              Visualization, …

                                                                                       Page 5
            © Hortonworks Inc. 2012
Modernising Business Intelligence
• Before:
  – Current records & short history
  – Analytics/BI systems keep conformed / cleaned / digested data
  – Unstructured data locked silos, archived offline
  Inflexible, new questions require system redesigns


• Now
  – Keep raw data in Hadoop for a long time
  – Reprocess/enhance analytics/BI data on-demand
  – Can directly experiment on all raw data
  – New products / services can be added very quickly
  Storage and agility justifies new infrastructure


                                                                    Page 6
        © Hortonworks Inc. 2012
Refineries pull in raw data
Internal: pipelines with Apache Flume
  – Web site logs
  – Real-world events: retail, financial, vehicle movements
  – New data sources you create
   The data you couldn't afford to keep

External: pipelines and bulk deliveries
  – Correlating data: weather, market, competition
  – New sources -twitter feeds, infochimps, open government
  – Real-world events: retail, financial
  – Apache Sqoop
   To help understand your own data

                                                              Page 8
      © Hortonworks Inc. 2012
Refineries refine raw data
• Clean up raw data
• Filter “cleaned” data

• Forward data to different destinations:
  – Existing BI infrastructure
  – New “Agile Data” infrastructures


• Offload work from the core Data Warehouse
  – ETL operations
  – Report and Chart Generation
  – Ad-hoc queries


      Needs: query, workflow and reporting tools
                                                   Page 9
      © Hortonworks Inc. 2012
Refineries can store data
• Retain historical transaction data, analyses
• Store (cleaned, filtered, compressed) raw data
• Provide the history for more advanced analysis in
  future applications and queries

• Needs: storage, query tools
  – Storage: HDFS and HBase
  – Languages: Pig & Hive
  – Workflow for scheduled jobs: Oozie
  – Shared schema repository: HCatalog



Hadoop makes storing bulk & historical data affordable
                                                      Page 10
     © Hortonworks Inc. 2012
What if I didn't have a Data
Warehouse?




                               Page 12
© Hortonworks Inc. 2012
Congratulations!


1. HBase: scale, Hadoop integration

2. mongoDB, CouchDB, Riak
   good for web UIs

3. Postgres, MySQL, …
   transactions
                                  Page 13
    © Hortonworks Inc. 2012
Agile Data




                          Page 14
© Hortonworks Inc. 2012
Agile Data
• SQL Experts: Hive HQL queries
• Ad-hoc queries: Pig
• Statistics platform: R + Hadoop
• Visualisation tools –including Excel
• New web UI applications




 Because you don’t know all that you are looking for
            when you collect the data


                                                   Page 15
      © Hortonworks Inc. 2012
Page 16
© Hortonworks Inc. 2012
Pig: an Agile Data language
• Optimised for refining data
• Dataflow-driven –much higher level than Java
• Macros and User Defined Functions
• ILLUSTRATE aids development
• For ad-hoc and production use




                                                 Page 17
     © Hortonworks Inc. 2012
Example: Packetpig
snort_alerts = LOAD '$pcap'
  USING
com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig');

countries = FOREACH snort_alerts
  GENERATE
    com.packetloop.packetpig.udf.geoip.Country(src) as country,
    priority;

countries = GROUP countries BY country;

countries = FOREACH countries
  GENERATE
    group,
    AVG(countries.priority) as average_severity;

STORE countries into 'output/choropleth_countries' using PigStorage(',');


                                                                       Page 18
          © Hortonworks Inc. 2012
web UI: d3.js




                              Page 19
    © Hortonworks Inc. 2012
Analytics Apps: It takes a Team
• Broad skill-set to make useful apps
• Basically nobody has them all
• Application development is inherently collaborative




                                                        Page 20
     © Hortonworks Inc. 2012
Developers: learn statistics via Pig

Data Scientists & Statisticians:
learn Pig (and R)


Russ Jurney @ HUG UK in November
meetup.com/hadoop-users-group-uk/
                                       Page 21
    © Hortonworks Inc. 2012
Challenge:
Becoming a data-driven organisation




                                      Page 22
© Hortonworks Inc. 2012
Challenges
• Thinking of the right questions to ask

• Conducting valid experiments:
  A/B testing, surveys with effective sampling, …
  – Not: "try a web new design for a week"
  – Not: "please do a site survey" pop-up dialog


• Accepting negative results
  – "no design was better than the other"


• Accepting results you don't agree with
  – “trials imply the proposed strategy won't work”

                                                      Page 23
      © Hortonworks Inc. 2012
Example: Yahoo!
• Online Application logic driven by big lookup tables

• Lookup data computed periodically on Hadoop
  – Machine learning, other expensive computation offline
  – Personalization, classification, fraud, value analysis…


• Application development requires data science
  – Huge amounts of actually observed data key to modern apps
  – Hadoop used as the science platform




      Architecting
      © Hortonworks Inc. 2012 the Future of Big Data
                                                                Page 24
Yahoo! Homepage


 • Serving Maps                               SCIENCE       » Machine learning to build ever
        • Users - Interests                      HADOOP       better categorization models
                                                 CLUSTER
 • Five Minute                                                  CATEGORIZATION
                                   USER
   Production                  BEHAVIOR                         MODELS (weekly)


 • Weekly                                     PRODUCTION
   Categorization                                 HADOOP    » Identify user interests using
                                                  CLUSTER
   models                      SERVING                         Categorization models
                                  MAPS
                       (every 5 minutes)
                                                  USER
                                                BEHAVIOR



                            SERVING SYSTEMS                    ENGAGED USERS


   Build customised home pages with latest data (thousands / second)
Copyright Yahoo 2011                                                                          25
Conclusions

Hadoop can live alongside existing BI
systems –as a data refinery

•   Store, refine bulk & unstructured data
•   Archive data for long-term analysis
•   Support ad-hoc queries over bulk data
•   Become the data-science platform



               26
Thank You!
Questions & Answers

hortonworks.com/download




                              Page 27
    © Hortonworks Inc. 2012

More Related Content

What's hot

Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storagehybrid cloud
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousingDataWorks Summit
 
Modern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BIModern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BIKognitio
 
Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013Michael Hiskey
 
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariAmbari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariHortonworks
 
Bigger Data For Your Budget
Bigger Data For Your BudgetBigger Data For Your Budget
Bigger Data For Your BudgetHortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman
 
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big Data
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big DataMicrosoft and Hortonworks Delivers the Modern Data Architecture for Big Data
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big DataHortonworks
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? Datameer
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data ApplicationsRichard McDougall
 
Hadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHortonworks
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Jonathan Seidman
 
Cloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinarCloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinarHortonworks
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsVMware Tanzu
 
How Salesforce.com uses Hadoop
How Salesforce.com uses HadoopHow Salesforce.com uses Hadoop
How Salesforce.com uses HadoopNarayan Bharadwaj
 
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...DataStax Academy
 

What's hot (19)

Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
 
Modern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BIModern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BI
 
Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013
 
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariAmbari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
 
Bigger Data For Your Budget
Bigger Data For Your BudgetBigger Data For Your Budget
Bigger Data For Your Budget
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011
 
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big Data
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big DataMicrosoft and Hortonworks Delivers the Modern Data Architecture for Big Data
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big Data
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics?
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data Applications
 
Hadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data Processing
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011
 
Cloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinarCloudian 451-hortonworks - webinar
Cloudian 451-hortonworks - webinar
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
 
How Salesforce.com uses Hadoop
How Salesforce.com uses HadoopHow Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
 
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
 

Viewers also liked

Originate ideas part
Originate ideas partOriginate ideas part
Originate ideas partmattwako
 
Necto 16 training 19 - data security
Necto 16 training 19 -  data securityNecto 16 training 19 -  data security
Necto 16 training 19 - data securityPanorama Software
 
The Political Economy of the Trans-Pacific Partnership - UNCTAD
The Political Economy of the Trans-Pacific Partnership - UNCTADThe Political Economy of the Trans-Pacific Partnership - UNCTAD
The Political Economy of the Trans-Pacific Partnership - UNCTADIra Kristina Lumban Tobing
 
Necto 16 training 4 - analytics view component
Necto 16 training 4  - analytics view componentNecto 16 training 4  - analytics view component
Necto 16 training 4 - analytics view componentPanorama Software
 
Alianza estratégica
Alianza estratégicaAlianza estratégica
Alianza estratégicaPaola Salais
 
PCI-DSS Security Awareness
PCI-DSS Security AwarenessPCI-DSS Security Awareness
PCI-DSS Security AwarenessElsye Sutanawi
 
Spirit of PCI DSS by Dr. Anton Chuvakin
Spirit of PCI DSS by Dr. Anton ChuvakinSpirit of PCI DSS by Dr. Anton Chuvakin
Spirit of PCI DSS by Dr. Anton ChuvakinAnton Chuvakin
 
Vendor Management for PCI DSS; EI3PA; HIPAA and FFIEC
Vendor Management for PCI DSS; EI3PA; HIPAA and FFIECVendor Management for PCI DSS; EI3PA; HIPAA and FFIEC
Vendor Management for PCI DSS; EI3PA; HIPAA and FFIECKimberly Simon MBA
 
小さなお店(実店舗)のためのWeb集客セミナー
小さなお店(実店舗)のためのWeb集客セミナー小さなお店(実店舗)のためのWeb集客セミナー
小さなお店(実店舗)のためのWeb集客セミナーEbisui Kazunori
 
Color Blindness
Color BlindnessColor Blindness
Color BlindnessErik Hong
 
Code igniter + ci phpunit-test
Code igniter + ci phpunit-testCode igniter + ci phpunit-test
Code igniter + ci phpunit-testME iBotch
 
presentation Alastair Peoples
presentation Alastair Peoplespresentation Alastair Peoples
presentation Alastair PeoplesECR2015
 

Viewers also liked (20)

Necto 16 training 3 ribbon
Necto 16 training 3   ribbonNecto 16 training 3   ribbon
Necto 16 training 3 ribbon
 
Originate ideas part
Originate ideas partOriginate ideas part
Originate ideas part
 
Necto 16 training 19 - data security
Necto 16 training 19 -  data securityNecto 16 training 19 -  data security
Necto 16 training 19 - data security
 
ESP 9
ESP 9ESP 9
ESP 9
 
The Political Economy of the Trans-Pacific Partnership - UNCTAD
The Political Economy of the Trans-Pacific Partnership - UNCTADThe Political Economy of the Trans-Pacific Partnership - UNCTAD
The Political Economy of the Trans-Pacific Partnership - UNCTAD
 
Managing fires
Managing firesManaging fires
Managing fires
 
Jarrera leon
Jarrera leonJarrera leon
Jarrera leon
 
Necto 16 training 4 - analytics view component
Necto 16 training 4  - analytics view componentNecto 16 training 4  - analytics view component
Necto 16 training 4 - analytics view component
 
Alianza estratégica
Alianza estratégicaAlianza estratégica
Alianza estratégica
 
Laravel Unit Testing
Laravel Unit TestingLaravel Unit Testing
Laravel Unit Testing
 
PCI-DSS Security Awareness
PCI-DSS Security AwarenessPCI-DSS Security Awareness
PCI-DSS Security Awareness
 
Spirit of PCI DSS by Dr. Anton Chuvakin
Spirit of PCI DSS by Dr. Anton ChuvakinSpirit of PCI DSS by Dr. Anton Chuvakin
Spirit of PCI DSS by Dr. Anton Chuvakin
 
Vendor Management for PCI DSS; EI3PA; HIPAA and FFIEC
Vendor Management for PCI DSS; EI3PA; HIPAA and FFIECVendor Management for PCI DSS; EI3PA; HIPAA and FFIEC
Vendor Management for PCI DSS; EI3PA; HIPAA and FFIEC
 
Lessn plan
Lessn planLessn plan
Lessn plan
 
小さなお店(実店舗)のためのWeb集客セミナー
小さなお店(実店舗)のためのWeb集客セミナー小さなお店(実店舗)のためのWeb集客セミナー
小さなお店(実店舗)のためのWeb集客セミナー
 
Welding 2[EDocFind.com]
Welding 2[EDocFind.com]Welding 2[EDocFind.com]
Welding 2[EDocFind.com]
 
PCI DSS and PA DSS
PCI DSS and PA DSSPCI DSS and PA DSS
PCI DSS and PA DSS
 
Color Blindness
Color BlindnessColor Blindness
Color Blindness
 
Code igniter + ci phpunit-test
Code igniter + ci phpunit-testCode igniter + ci phpunit-test
Code igniter + ci phpunit-test
 
presentation Alastair Peoples
presentation Alastair Peoplespresentation Alastair Peoples
presentation Alastair Peoples
 

Similar to Hadoop as a Data Refinery for Business Insights

Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopHortonworks
 
Utrecht NL-HUG/Data Science-NL - Agile Data Slides
Utrecht NL-HUG/Data Science-NL - Agile Data SlidesUtrecht NL-HUG/Data Science-NL - Agile Data Slides
Utrecht NL-HUG/Data Science-NL - Agile Data SlidesHortonworks
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applicationsrussell_jurney
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack EuropeHortonworks
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendCaserta
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop siliconsudipt
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopCaserta
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata Hortonworks
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data SolutionsMark Kromer
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Barijaxconf
 

Similar to Hadoop as a Data Refinery for Business Insights (20)

201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on Hadoop
 
Utrecht NL-HUG/Data Science-NL - Agile Data Slides
Utrecht NL-HUG/Data Science-NL - Agile Data SlidesUtrecht NL-HUG/Data Science-NL - Agile Data Slides
Utrecht NL-HUG/Data Science-NL - Agile Data Slides
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack Europe
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 

More from JAX London

Everything I know about software in spaghetti bolognese: managing complexity
Everything I know about software in spaghetti bolognese: managing complexityEverything I know about software in spaghetti bolognese: managing complexity
Everything I know about software in spaghetti bolognese: managing complexityJAX London
 
Devops with the S for Sharing - Patrick Debois
Devops with the S for Sharing - Patrick DeboisDevops with the S for Sharing - Patrick Debois
Devops with the S for Sharing - Patrick DeboisJAX London
 
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
Busy Developer's Guide to Windows 8 HTML/JavaScript AppsBusy Developer's Guide to Windows 8 HTML/JavaScript Apps
Busy Developer's Guide to Windows 8 HTML/JavaScript AppsJAX London
 
It's code but not as we know: Infrastructure as Code - Patrick Debois
It's code but not as we know: Infrastructure as Code - Patrick DeboisIt's code but not as we know: Infrastructure as Code - Patrick Debois
It's code but not as we know: Infrastructure as Code - Patrick DeboisJAX London
 
Locks? We Don't Need No Stinkin' Locks - Michael Barker
Locks? We Don't Need No Stinkin' Locks - Michael BarkerLocks? We Don't Need No Stinkin' Locks - Michael Barker
Locks? We Don't Need No Stinkin' Locks - Michael BarkerJAX London
 
Worse is better, for better or for worse - Kevlin Henney
Worse is better, for better or for worse - Kevlin HenneyWorse is better, for better or for worse - Kevlin Henney
Worse is better, for better or for worse - Kevlin HenneyJAX London
 
Java performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJava performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJAX London
 
Clojure made-simple - John Stevenson
Clojure made-simple - John StevensonClojure made-simple - John Stevenson
Clojure made-simple - John StevensonJAX London
 
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias WessendorfHTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias WessendorfJAX London
 
Play framework 2 : Peter Hilton
Play framework 2 : Peter HiltonPlay framework 2 : Peter Hilton
Play framework 2 : Peter HiltonJAX London
 
Complexity theory and software development : Tim Berglund
Complexity theory and software development : Tim BerglundComplexity theory and software development : Tim Berglund
Complexity theory and software development : Tim BerglundJAX London
 
Why FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave GruberWhy FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave GruberJAX London
 
Akka in Action: Heiko Seeburger
Akka in Action: Heiko SeeburgerAkka in Action: Heiko Seeburger
Akka in Action: Heiko SeeburgerJAX London
 
NoSQL Smackdown 2012 : Tim Berglund
NoSQL Smackdown 2012 : Tim BerglundNoSQL Smackdown 2012 : Tim Berglund
NoSQL Smackdown 2012 : Tim BerglundJAX London
 
Closures, the next "Big Thing" in Java: Russel Winder
Closures, the next "Big Thing" in Java: Russel WinderClosures, the next "Big Thing" in Java: Russel Winder
Closures, the next "Big Thing" in Java: Russel WinderJAX London
 
Java and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJava and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJAX London
 
Mongo DB on the JVM - Brendan McAdams
Mongo DB on the JVM - Brendan McAdamsMongo DB on the JVM - Brendan McAdams
Mongo DB on the JVM - Brendan McAdamsJAX London
 
New opportunities for connected data - Ian Robinson
New opportunities for connected data - Ian RobinsonNew opportunities for connected data - Ian Robinson
New opportunities for connected data - Ian RobinsonJAX London
 
HTML5 Websockets and Java - Arun Gupta
HTML5 Websockets and Java - Arun GuptaHTML5 Websockets and Java - Arun Gupta
HTML5 Websockets and Java - Arun GuptaJAX London
 
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian PloskerThe Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian PloskerJAX London
 

More from JAX London (20)

Everything I know about software in spaghetti bolognese: managing complexity
Everything I know about software in spaghetti bolognese: managing complexityEverything I know about software in spaghetti bolognese: managing complexity
Everything I know about software in spaghetti bolognese: managing complexity
 
Devops with the S for Sharing - Patrick Debois
Devops with the S for Sharing - Patrick DeboisDevops with the S for Sharing - Patrick Debois
Devops with the S for Sharing - Patrick Debois
 
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
Busy Developer's Guide to Windows 8 HTML/JavaScript AppsBusy Developer's Guide to Windows 8 HTML/JavaScript Apps
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
 
It's code but not as we know: Infrastructure as Code - Patrick Debois
It's code but not as we know: Infrastructure as Code - Patrick DeboisIt's code but not as we know: Infrastructure as Code - Patrick Debois
It's code but not as we know: Infrastructure as Code - Patrick Debois
 
Locks? We Don't Need No Stinkin' Locks - Michael Barker
Locks? We Don't Need No Stinkin' Locks - Michael BarkerLocks? We Don't Need No Stinkin' Locks - Michael Barker
Locks? We Don't Need No Stinkin' Locks - Michael Barker
 
Worse is better, for better or for worse - Kevlin Henney
Worse is better, for better or for worse - Kevlin HenneyWorse is better, for better or for worse - Kevlin Henney
Worse is better, for better or for worse - Kevlin Henney
 
Java performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJava performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha Gee
 
Clojure made-simple - John Stevenson
Clojure made-simple - John StevensonClojure made-simple - John Stevenson
Clojure made-simple - John Stevenson
 
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias WessendorfHTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
 
Play framework 2 : Peter Hilton
Play framework 2 : Peter HiltonPlay framework 2 : Peter Hilton
Play framework 2 : Peter Hilton
 
Complexity theory and software development : Tim Berglund
Complexity theory and software development : Tim BerglundComplexity theory and software development : Tim Berglund
Complexity theory and software development : Tim Berglund
 
Why FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave GruberWhy FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave Gruber
 
Akka in Action: Heiko Seeburger
Akka in Action: Heiko SeeburgerAkka in Action: Heiko Seeburger
Akka in Action: Heiko Seeburger
 
NoSQL Smackdown 2012 : Tim Berglund
NoSQL Smackdown 2012 : Tim BerglundNoSQL Smackdown 2012 : Tim Berglund
NoSQL Smackdown 2012 : Tim Berglund
 
Closures, the next "Big Thing" in Java: Russel Winder
Closures, the next "Big Thing" in Java: Russel WinderClosures, the next "Big Thing" in Java: Russel Winder
Closures, the next "Big Thing" in Java: Russel Winder
 
Java and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJava and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk Pepperdine
 
Mongo DB on the JVM - Brendan McAdams
Mongo DB on the JVM - Brendan McAdamsMongo DB on the JVM - Brendan McAdams
Mongo DB on the JVM - Brendan McAdams
 
New opportunities for connected data - Ian Robinson
New opportunities for connected data - Ian RobinsonNew opportunities for connected data - Ian Robinson
New opportunities for connected data - Ian Robinson
 
HTML5 Websockets and Java - Arun Gupta
HTML5 Websockets and Java - Arun GuptaHTML5 Websockets and Java - Arun Gupta
HTML5 Websockets and Java - Arun Gupta
 
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian PloskerThe Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
 

Hadoop as a Data Refinery for Business Insights

  • 1. Hadoop as a Data Refinery Steve Loughran– Hortonworks @steveloughran London, October 2012 © Hortonworks Inc. 2012
  • 2. About me: • HP Labs: –Deployment, cloud infrastructure, Hadoop-in-Cloud • Apache – member and committer –Ant, Axis ; author: Ant in Action –Hadoop –Dynamic deployments –Diagnostics on failures –Cloud infrastructure integration • Joined Hortonworks in 2012 –UK based: R&D Page 2 © Hortonworks Inc. 2012
  • 3. What is Apache Hadoop? • Collection of Open Source Projects One of the best examples of – Apache Software Foundation (ASF) open source driving innovation – commercial and community development and creating a market • Foundation for Big Data Solutions – Stores petabytes of data reliably – Runs highly distributed computation – Commodity servers & storage – Powers data-driven business Page 3 © Hortonworks Inc. 2012
  • 4. Why Hadoop? Business Pressure 1 Opportunity to enable innovative new business models 2 Potential new insights that drive competitive advantage Technical Pressure 3 Data collected and stored continues to grow exponentially 4 Data is increasingly everywhere and in many formats 5 Traditional solutions not designed for new requirements Financial Pressure 6 Cost of data systems, as % of IT spend, continues to grow 7 Cost advantages of commodity hardware & open source Page 4 © Hortonworks Inc. 2012
  • 5. The data refinery in an enterprise Audio, Web, Mobile, CRM, Video, ERP, SCM, … Images New Data Business Transactions Docs, Sources Text, & Interactions XML HDFS Web Logs, Clicks Big Data SQL NoSQL NewSQL Social, Refinery Graph, ETL Feeds EDW MPP NewSQL Sensors, Devices, RFID Business Pig Spatial, Intelligence GPS Apache Hadoop & Analytics Events, Other Dashboards, Reports, Visualization, … Page 5 © Hortonworks Inc. 2012
  • 6. Modernising Business Intelligence • Before: – Current records & short history – Analytics/BI systems keep conformed / cleaned / digested data – Unstructured data locked silos, archived offline Inflexible, new questions require system redesigns • Now – Keep raw data in Hadoop for a long time – Reprocess/enhance analytics/BI data on-demand – Can directly experiment on all raw data – New products / services can be added very quickly Storage and agility justifies new infrastructure Page 6 © Hortonworks Inc. 2012
  • 7. Refineries pull in raw data Internal: pipelines with Apache Flume – Web site logs – Real-world events: retail, financial, vehicle movements – New data sources you create The data you couldn't afford to keep External: pipelines and bulk deliveries – Correlating data: weather, market, competition – New sources -twitter feeds, infochimps, open government – Real-world events: retail, financial – Apache Sqoop To help understand your own data Page 8 © Hortonworks Inc. 2012
  • 8. Refineries refine raw data • Clean up raw data • Filter “cleaned” data • Forward data to different destinations: – Existing BI infrastructure – New “Agile Data” infrastructures • Offload work from the core Data Warehouse – ETL operations – Report and Chart Generation – Ad-hoc queries Needs: query, workflow and reporting tools Page 9 © Hortonworks Inc. 2012
  • 9. Refineries can store data • Retain historical transaction data, analyses • Store (cleaned, filtered, compressed) raw data • Provide the history for more advanced analysis in future applications and queries • Needs: storage, query tools – Storage: HDFS and HBase – Languages: Pig & Hive – Workflow for scheduled jobs: Oozie – Shared schema repository: HCatalog Hadoop makes storing bulk & historical data affordable Page 10 © Hortonworks Inc. 2012
  • 10. What if I didn't have a Data Warehouse? Page 12 © Hortonworks Inc. 2012
  • 11. Congratulations! 1. HBase: scale, Hadoop integration 2. mongoDB, CouchDB, Riak good for web UIs 3. Postgres, MySQL, … transactions Page 13 © Hortonworks Inc. 2012
  • 12. Agile Data Page 14 © Hortonworks Inc. 2012
  • 13. Agile Data • SQL Experts: Hive HQL queries • Ad-hoc queries: Pig • Statistics platform: R + Hadoop • Visualisation tools –including Excel • New web UI applications Because you don’t know all that you are looking for when you collect the data Page 15 © Hortonworks Inc. 2012
  • 15. Pig: an Agile Data language • Optimised for refining data • Dataflow-driven –much higher level than Java • Macros and User Defined Functions • ILLUSTRATE aids development • For ad-hoc and production use Page 17 © Hortonworks Inc. 2012
  • 16. Example: Packetpig snort_alerts = LOAD '$pcap' USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig'); countries = FOREACH snort_alerts GENERATE com.packetloop.packetpig.udf.geoip.Country(src) as country, priority; countries = GROUP countries BY country; countries = FOREACH countries GENERATE group, AVG(countries.priority) as average_severity; STORE countries into 'output/choropleth_countries' using PigStorage(','); Page 18 © Hortonworks Inc. 2012
  • 17. web UI: d3.js Page 19 © Hortonworks Inc. 2012
  • 18. Analytics Apps: It takes a Team • Broad skill-set to make useful apps • Basically nobody has them all • Application development is inherently collaborative Page 20 © Hortonworks Inc. 2012
  • 19. Developers: learn statistics via Pig Data Scientists & Statisticians: learn Pig (and R) Russ Jurney @ HUG UK in November meetup.com/hadoop-users-group-uk/ Page 21 © Hortonworks Inc. 2012
  • 20. Challenge: Becoming a data-driven organisation Page 22 © Hortonworks Inc. 2012
  • 21. Challenges • Thinking of the right questions to ask • Conducting valid experiments: A/B testing, surveys with effective sampling, … – Not: "try a web new design for a week" – Not: "please do a site survey" pop-up dialog • Accepting negative results – "no design was better than the other" • Accepting results you don't agree with – “trials imply the proposed strategy won't work” Page 23 © Hortonworks Inc. 2012
  • 22. Example: Yahoo! • Online Application logic driven by big lookup tables • Lookup data computed periodically on Hadoop – Machine learning, other expensive computation offline – Personalization, classification, fraud, value analysis… • Application development requires data science – Huge amounts of actually observed data key to modern apps – Hadoop used as the science platform Architecting © Hortonworks Inc. 2012 the Future of Big Data Page 24
  • 23. Yahoo! Homepage • Serving Maps SCIENCE » Machine learning to build ever • Users - Interests HADOOP better categorization models CLUSTER • Five Minute CATEGORIZATION USER Production BEHAVIOR MODELS (weekly) • Weekly PRODUCTION Categorization HADOOP » Identify user interests using CLUSTER models SERVING Categorization models MAPS (every 5 minutes) USER BEHAVIOR SERVING SYSTEMS ENGAGED USERS Build customised home pages with latest data (thousands / second) Copyright Yahoo 2011 25
  • 24. Conclusions Hadoop can live alongside existing BI systems –as a data refinery • Store, refine bulk & unstructured data • Archive data for long-term analysis • Support ad-hoc queries over bulk data • Become the data-science platform 26
  • 25. Thank You! Questions & Answers hortonworks.com/download Page 27 © Hortonworks Inc. 2012

Editor's Notes

  1. In the graphic above, Apache Hadoop acts as the Big Data Refinery. It’s great at storing, aggregating, and transforming multi-structured data into more useful and valuable formats.Apache Hive is a Hadoop-related component that fits within the Business Intelligence & Analytics category since it is commonly used for querying and analyzing data within Hadoop in a SQL-like manner. Apache Hadoop can also be integrated with other EDW, MPP, and NewSQL components such as Teradata, Aster Data, HP Vertica, IBM Netezza, EMC Greenplum, SAP Hana, Microsoft SQL Server PDW and many others.Apache HBase is a Hadoop-related NoSQL Key/Value store that is commonly used for building highly responsive next-generation applications. Apache Hadoop can also be integrated with other SQL, NoSQL, and NewSQL technologies such as Oracle, MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2, MongoDB, DynamoDB, MarkLogic, Riak, Redis, Neo4J, Terracotta, GemFire, SQLFire, VoltDB and many others.Finally, data movement and integration technologies help ensure data flows seamlessly between the systems in the above diagrams; the lines in the graphic are powered by technologies such as WebHDFS, Apache HCatalog, Apache Sqoop, Talend Open Studio for Big Data, Informatica, Pentaho, SnapLogic, Splunk, Attunity and many others.
  2. At the highest level, I describe three broad areas of data processing and outline how these areas interconnect.The three areas are:1.Business Transactions & Interactions2. Business Intelligence & Analytics3. Big Data RefineryThe graphic illustrates a vision for how these three types of systems can interconnect in ways aimed at deriving maximum value from all forms of data.Enterprise IT has been connecting systems via classic ETL processing, as illustrated in Step 1 above, for many years in order to deliver structured and repeatable analysis. In this step, the business determines the questions to ask and IT collects and structures the data needed to answer those questions.The “Big Data Refinery”, as highlighted in Step 2, is a new system capable of storing, aggregating, and transforming a wide range of multi-structured raw data sources into usable formats that help fuel new insights for the business. The Big Data Refinery provides a cost-effective platform for unlocking the potential value within data and discovering the business questions worth answering with this data. A popular example of big data refining is processing Web logs, clickstreams, social interactions, social feeds, and other user generated data sources into more accurate assessments of customer churn or more effective creation of personalized offers.More interestingly, there are businesses deriving value from processing large video, audio, and image files. Retail stores, for example, are leveraging in-store video feeds to help them better understand how customers navigate the aisles as they find and purchase products. Retailers that provide optimized shopping paths and intelligent product placement within their stores are able to drive more revenue for the business. In this case, while the video files may be big in size, the refined output of the analysis is typically small in size but potentially big in value.The Big Data Refinery platform provides fertile ground for new types of tools and data processing workloads to emerge in support of rich multi-level data refinement solutions.With that as backdrop, Step 3 takes the model further by showing how the Big Data Refinery interacts with the systems powering Business Transactions & Interactions and Business Intelligence & Analytics. Interacting in this way opens up the ability for businesses to get a richer and more informed 360 ̊ view of customers, for example.By directly integrating the Big Data Refinery with existing Business Intelligence & Analytics solutions that contain much of the transactional information for the business, companies can enhance their ability to more accurately understand the customer behaviors that lead to the transactions.Moreover, systems focused on Business Transactions & Interactions can also benefit from connecting with the Big Data Refinery. Complex analytics and calculations of key parameters can be performed in the refinery and flow downstream to fuel runtime models powering business applications with the goal of more accurately targeting customers with the best and most relevant offers, for example.Since the Big Data Refinery is great at retaining large volumes of data for long periods of time, the model is completed with the feedback loops illustrated in Steps 4 and 5. Retaining the past 10 years of historical “Black Friday” retail data, for example, can benefit the business, especially if it’s blended with other data sources such as 10 years of weather data accessed from a third party data provider. The point here is that the opportunities for creating value from multi-structured data sources available inside and outside the enterprise are virtually endless if you have a platform that can do it cost effectively and at scale.
  3. Real world data is 'dirty' -you need to clean it upExamples: merge multiple events into one of an extended periodSanity check events against your world view (how fast things move, how much things cost). There is much danger here.text cleanup, discard empty fieldsYou may still want to retain the original data to see what was filtered -at the very least log & sample the outliers
  4. This is taking a metaphor beyond the limits: all that comes next is photos of Grangemout or Milford Haven.Real world refineries have giant storage tanks to buffer differences between ingress and egress rates.Here we are proposing keeping data near the refinery
  5. RCFile (Record Columnar File)http://en.wikipedia.org/wiki/RCFileHCatalog is a table abstraction and a storage abstraction system that makes it easy for multiple tools to interact with the same underlying data. A common buzzword in the NoSQL world today is that of polyglot persistence. Basically, what that comes down to is that you pick the right tool for the job. In the Hadoop ecosystem, you have many tools that might be used for data processing - you might use Pig or Hive, or your own custom MapReduce program, or that shiny new GUI-based tool that's just come out. And which one to use might depend on the user, or on the type of query you're interested in, or the type of job we want to run. From another perspective, you might want to store your data in columnar storage for efficient storage and retrieval for particular query types, or in text so that users can write data producers in scripting languages like Perl or Python, or you may want to hook up that HBase table as a data source. As a end-user, I want to use whatever data processing tool is available to me. As a data designer, I want to optimize how data is stored. As a cluster manager/data architect, I want the ability to share pieces of information across the board, and move data back and forth fluidly. HCatalog's hopes and promises are the realization of all of the above.
  6. This is an example that's gone up our web site recently, using Pig to analyse NetFlow packets and so look for origins over time. That's the kind of thing you can only do with large datasets. Using a language like Pig helps you look at the numbers and decide what the next questions to ask are.
  7. This is important. once you start becoming more aware of your customers, your potential customers, your internal state and the world outside -you have more information than ever before.Yet you still need to analyse it.
  8. Conducting valid experiments: A/B testing of two different options must be conducted truly at random, to avoid selection bias or influence by external factorsAccepting negative results: It's OK to have an outcome that says "neither option is any better or worse than the other"Accepting results you don't agree with: evidence your idea doesn't work. no 3, is hard -and why you need large, valid sample sets. Otherwise you could dismiss it as a bad experiment. Governments are classic examples of organisations that don't do this. Badger Culling and Drug Policies are key examples -policy is driven by the belief of constituencies (farmers, daily mail), rather than recognising the evidence and trying to explain to the constituencies that they are mistaken. This isn't a critique of the current administration -the previous one was also belief-driven rather than fact-driven.
  9. This is just one scenario of how data can flow into a combined ecosystem of Hadoop, Aster Data, and Teradata. In this scenario, Hadoop is acting as a raw data store and transformation engine to load Aster Data. There are also scenarios where raw data could be loaded directly into Aster Data and Teradata systems. The point is that each system has a role and focus for the customer.The goal is to understand how harnessing more/all of the data provides more value to the customer, more users on the Aster and Teradata systems, more data-driven applications, and more.
  10. This is just one scenario of how data can flow into a combined ecosystem of Hadoop, Aster Data, and Teradata. In this scenario, Hadoop is acting as a raw data store and transformation engine to load Aster Data. There are also scenarios where raw data could be loaded directly into Aster Data and Teradata systems. The point is that each system has a role and focus for the customer.The goal is to understand how harnessing more/all of the data provides more value to the customer, more users on the Aster and Teradata systems, more data-driven applications, and more.