SlideShare a Scribd company logo
1 of 49
Making Sense of

BIG DATA          with Hadoop
●   13 years with a pager
●   Oracle ACE Director
●   Oak table member
●   Senior consultant for Pythian
●   @gwenshap
●   http://www.pythian.com/news/
    author/shapira/
●   shapira@pythian.com




                      © 2012 Pythian
Pythian
    Recognized Leader:
    •   Global industry-leader in remote database administration services and consulting for
        Oracle, Oracle Applications, MySQL and Microsoft SQL Server

    •   Work with over 165 multinational companies such as LinkShare Corporation, IGN
        Entertainment, CrowdTwist, TinyCo and Western Union to help manage their
        complex IT deployments

    Expertise:
    •   One of the world’s largest concentrations of dedicated, full-time DBA expertise.
        Employ 7 Oracle ACEs/ACE Directors. Heavily involved in the MySQL community,
        driving the MySQL Professionals Group and sit on the IOUG Advisory Board for MySQL.

    •   Hold 7 Specializations under Oracle Platinum Partner program, including Oracle
        Exadata, Oracle GoldenGate & Oracle RAC

    Global Reach & Scalability:
    •   24/7/365 global remote support for DBA and consulting, systems administration,
        special projects or emergency response

3                               © 2012 Pythian
What is Big Data?
MORE DATA THAN
YOU CAN HANDLE



   © 2012 Pythian
MORE DATA THAN
RELATIONAL
DATABASES
CAN HANDLE


   © 2012 Pythian
MORE DATA THAN
RELATIONAL
DATABASES
CAN HANDLE
CHEAPLY

   © 2012 Pythian
Data Arriving at fast Rates
Typically unstructured
Stored without aggregation
Analyzed in Real Time
For Reasonable Cost



       © 2012 Pythian
Complex Data Architecture




     © 2012 Pythian
Your Data
                   is   NOT
                  as    BIG
                 as you think




© 2012 Pythian
Why Big Data?
Why Hadoop?
BECAUSE WE CAN



    © 2012 Pythian
More Data Beats Smarter
Algorithms




         © 2012 Pythian
email
                                 Photos
         Job posting

Tweets         Video              Medical
                                  imaging
   Sensors                 Blog posts
  Tags         Scanned docs
          © 2012 Pythian
Data is Messy
An Imperial College Team found:
     •3,000 patients under 19 were treated in geriatric clinics


     • between 15,000 and 20,000 men have been admitted to
     obstetric wards


     •and almost 10,000 to gynecology wards




     http://www.straightstatistics.org/blog/2012/04/06/why-are-so-many-men-pregnant



16                               © 2012 Pythian
Unstructured
Eventually Structured Data
Scalable Storage
            +
Massive Parallel Processing
            +
     Reasonable Cost



      © 2012 Pythian
Hadoop: Platform for distributed
computing




         © 2012 Pythian
Hadoop is Scalable. But not fast.




          © 2012 Pythian
Much Ado about Hadoop
Assumptions
• Lots of data
• Large Files
• Unstructured
• Scan entire files
• Unreliable Hardware
• Adding servers = increase capacity




                      © 2012 Pythian
Principles
• Bring Code to Data
• Share Nothing




                  © 2012 Pythian
HDFS
• Distributed
• Replicated
• Big Files
• Write Once
• Read Entire File




                     © 2012 Pythian
/users/shapira/log-1, blocks {1,4,5}
        /users/shapira/log-2, blocks {2,3,6}




1   4                     5        2           3

                                               1   4
5



2   4                 1        3               2   3

6                     6                        5   6




              © 2012 Pythian
Map Reduce
                Combine
         Map                Reduce
 Start
         Map                           Stop
 Job 1                      Reduce?
                                       Job 1
         …                    …

         Map                Reduce?

               Hadoop Job
                                      Results
                Combine
         Map                Reduce

 Start   Map                Reduce?
 Job 2                                 Stop
         …                             Job 1
                              …

         Map
                            Reduce?
Implementation
• Balance disks, cores and RAM
• High Bandwidth
• More nodes or better nodes?




                   © 2012 Pythian
It’s about the Ecosystem
• Sqoop
• Flume
• Hive
• Pig
• HBase




          © 2012 Pythian
Use Cases
Use Case:
Log processing
Use Case:
            ETL
                        BI




OLTP                    DWH



       © 2012 Pythian
Use Case:
Recommendations
Use case:
Listening to the crowd




         © 2012 Pythian
Our customers use Hadoop for:
     • Storing lots of pre-processed data
     • Merging different data types
     • Scalable data processing
     • Advanced data processing




34                     © 2012 Pythian
Big Data in your Company
Easy case:
Your CTO heard about Big Data
And is eager to invest.
You have a Big Budget.



       © 2012 Pythian
Require


Measure                        Acquire




 Serve                         Organize


                     Analyze



    © 2012 Pythian
Require

                                Hadoop
 Measure
                                NoSQL
                                 OLTP

  BI,
NoSQL,                          RDMB
Oracle

                      Hadoop
                       BI, R


     © 2012 Pythian
Data Scientist
=
Sneaky BI
Disregards Silos
Cool Toys




       © 2012 Pythian
Mining Tools:
• Machine Learning
• Cluster Detection
• Regression
• Graph Analysis
• Visualization



        © 2012 Pythian
http://nicolasrapp.com/?p=1118

                         © 2012 Pythian
http://www.orgnet.com/slumlords.html

                     © 2012 Pythian
Want to do more with your data?
Don’t know where to start?
No budget?

No problem!


       © 2012 Pythian
Sneak Hadoop to Your Business
• Find an important business problem
• Acquire data (be sneaky!)
• Get the tools: R, Hadoop, Tableau
• Laptops, desktops, test servers
• Analyze data
• Make pretty charts
• Get business used to it
• Wait for an Outage
• PROFIT!




                  © 2012 Pythian
Oracle Big Data
The “ETL Machine”
Hardware
18 servers
216 cores
864G RAM
648T disks
Infiniband




             © 2012 Pythian
Software
Oracle NoSQL
Cloudera Hadoop Distribution
Oracle Loader for Hadoop
Data Integrator for Hadoop
Direct Connector for Hadoop
Oracle Connector for R



           © 2012 Pythian
Cores, Storage, Infiniband and Software
Makes Oracle Big Data
The Ultimate ETL Machine



            © 2012 Pythian
Thank you & Q&A
     To contact us…

           sales@pythian.com

           1-866-PYTHIAN



     To follow us…
           http://www.pythian.com/news/

           http://www.facebook.com/pages/The-Pythian-Group/

           http://twitter.com/pythian

           http://www.linkedin.com/company/pythian



49                           © 2012 Pythian

More Related Content

What's hot

Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudLeons Petražickis
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...DataWorks Summit
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
A Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceA Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceSlim Baltagi
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXBMC Software
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data avanttic Consultoría Tecnológica
 
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Alluxio, Inc.
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big DataDataWorks Summit
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutionssolarisyougood
 
How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)
How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)
How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)Ontico
 
Wrangling Customer Usage Data with Hadoop
Wrangling Customer Usage Data with HadoopWrangling Customer Usage Data with Hadoop
Wrangling Customer Usage Data with HadoopDataWorks Summit
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAlluxio, Inc.
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14iwrigley
 
Building a Digital Bank
Building a Digital BankBuilding a Digital Bank
Building a Digital BankDataStax
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningDataWorks Summit
 

What's hot (20)

Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data Applications
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
A Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceA Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to Finance
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
 
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
Big Data: Myths and Realities
Big Data: Myths and RealitiesBig Data: Myths and Realities
Big Data: Myths and Realities
 
How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)
How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)
How to run Real Time processing on Big Data / Ron Zavner (GigaSpaces)
 
Wrangling Customer Usage Data with Hadoop
Wrangling Customer Usage Data with HadoopWrangling Customer Usage Data with Hadoop
Wrangling Customer Usage Data with Hadoop
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
 
Building a Digital Bank
Building a Digital BankBuilding a Digital Bank
Building a Digital Bank
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine Learning
 

Viewers also liked

An Intro to Atom Editor
An Intro to Atom EditorAn Intro to Atom Editor
An Intro to Atom EditorAteev Chopra
 
Box2D with SIMD in JavaScript
Box2D with SIMD in JavaScriptBox2D with SIMD in JavaScript
Box2D with SIMD in JavaScriptIntel® Software
 
Intro to-freecad
Intro to-freecadIntro to-freecad
Intro to-freecadSLQedge
 
Google guava overview
Google guava overviewGoogle guava overview
Google guava overviewSteve Min
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM Cynthia Saracco
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 

Viewers also liked (20)

An Intro to Atom Editor
An Intro to Atom EditorAn Intro to Atom Editor
An Intro to Atom Editor
 
Box2D with SIMD in JavaScript
Box2D with SIMD in JavaScriptBox2D with SIMD in JavaScript
Box2D with SIMD in JavaScript
 
Intro to-freecad
Intro to-freecadIntro to-freecad
Intro to-freecad
 
Google guava
Google guavaGoogle guava
Google guava
 
Google guava overview
Google guava overviewGoogle guava overview
Google guava overview
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Google Guava
Google GuavaGoogle Guava
Google Guava
 
IT Quiz
IT QuizIT Quiz
IT Quiz
 
Cisco project ideas
Cisco   project ideasCisco   project ideas
Cisco project ideas
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Atom IDE
Atom IDEAtom IDE
Atom IDE
 

Similar to Making Sense of Big data with Hadoop

Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoopinside-BigData.com
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
UK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopUK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopHortonworks
 
Understanding Big Data for policy professionals
Understanding Big Data for policy professionalsUnderstanding Big Data for policy professionals
Understanding Big Data for policy professionalsAlex Jouravlev
 
Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data TechnologiesDATAVERSITY
 
Offload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data IntegrationOffload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data IntegrationMichael Rainey
 
Webinar - Big Data: Einführung in Hadoop und MapReduce
Webinar - Big Data: Einführung in Hadoop und MapReduceWebinar - Big Data: Einführung in Hadoop und MapReduce
Webinar - Big Data: Einführung in Hadoop und MapReduceinovex GmbH
 
Operationalizing Data Analytics
Operationalizing Data AnalyticsOperationalizing Data Analytics
Operationalizing Data AnalyticsVMware Tanzu
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
Conflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big DataConflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big DataHalo BI
 
Hadoop for Business Intelligence Professionals
Hadoop for Business Intelligence ProfessionalsHadoop for Business Intelligence Professionals
Hadoop for Business Intelligence ProfessionalsSkillspeed
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Game Changed – How Hadoop is Reinventing Enterprise Thinking
Game Changed – How Hadoop is Reinventing Enterprise ThinkingGame Changed – How Hadoop is Reinventing Enterprise Thinking
Game Changed – How Hadoop is Reinventing Enterprise ThinkingInside Analysis
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Data Con LA
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopHortonworks
 

Similar to Making Sense of Big data with Hadoop (20)

Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
UK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopUK - Agile Data Applications on Hadoop
UK - Agile Data Applications on Hadoop
 
Understanding Big Data for policy professionals
Understanding Big Data for policy professionalsUnderstanding Big Data for policy professionals
Understanding Big Data for policy professionals
 
Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data Technologies
 
Offload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data IntegrationOffload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data Integration
 
Webinar - Big Data: Einführung in Hadoop und MapReduce
Webinar - Big Data: Einführung in Hadoop und MapReduceWebinar - Big Data: Einführung in Hadoop und MapReduce
Webinar - Big Data: Einführung in Hadoop und MapReduce
 
Operationalizing Data Analytics
Operationalizing Data AnalyticsOperationalizing Data Analytics
Operationalizing Data Analytics
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Conflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big DataConflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big Data
 
Hadoop for Business Intelligence Professionals
Hadoop for Business Intelligence ProfessionalsHadoop for Business Intelligence Professionals
Hadoop for Business Intelligence Professionals
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Game Changed – How Hadoop is Reinventing Enterprise Thinking
Game Changed – How Hadoop is Reinventing Enterprise ThinkingGame Changed – How Hadoop is Reinventing Enterprise Thinking
Game Changed – How Hadoop is Reinventing Enterprise Thinking
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on Hadoop
 

More from Gwen (Chen) Shapira

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep DiveGwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGwen (Chen) Shapira
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebookGwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersGwen (Chen) Shapira
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira
 

More from Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 

Recently uploaded

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Recently uploaded (20)

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

Making Sense of Big data with Hadoop

  • 1. Making Sense of BIG DATA with Hadoop
  • 2. 13 years with a pager ● Oracle ACE Director ● Oak table member ● Senior consultant for Pythian ● @gwenshap ● http://www.pythian.com/news/ author/shapira/ ● shapira@pythian.com © 2012 Pythian
  • 3. Pythian Recognized Leader: • Global industry-leader in remote database administration services and consulting for Oracle, Oracle Applications, MySQL and Microsoft SQL Server • Work with over 165 multinational companies such as LinkShare Corporation, IGN Entertainment, CrowdTwist, TinyCo and Western Union to help manage their complex IT deployments Expertise: • One of the world’s largest concentrations of dedicated, full-time DBA expertise. Employ 7 Oracle ACEs/ACE Directors. Heavily involved in the MySQL community, driving the MySQL Professionals Group and sit on the IOUG Advisory Board for MySQL. • Hold 7 Specializations under Oracle Platinum Partner program, including Oracle Exadata, Oracle GoldenGate & Oracle RAC Global Reach & Scalability: • 24/7/365 global remote support for DBA and consulting, systems administration, special projects or emergency response 3 © 2012 Pythian
  • 4. What is Big Data?
  • 5. MORE DATA THAN YOU CAN HANDLE © 2012 Pythian
  • 7. MORE DATA THAN RELATIONAL DATABASES CAN HANDLE CHEAPLY © 2012 Pythian
  • 8. Data Arriving at fast Rates Typically unstructured Stored without aggregation Analyzed in Real Time For Reasonable Cost © 2012 Pythian
  • 9. Complex Data Architecture © 2012 Pythian
  • 10. Your Data is NOT as BIG as you think © 2012 Pythian
  • 11. Why Big Data? Why Hadoop?
  • 12. BECAUSE WE CAN © 2012 Pythian
  • 13. More Data Beats Smarter Algorithms © 2012 Pythian
  • 14. email Photos Job posting Tweets Video Medical imaging Sensors Blog posts Tags Scanned docs © 2012 Pythian
  • 16. An Imperial College Team found: •3,000 patients under 19 were treated in geriatric clinics • between 15,000 and 20,000 men have been admitted to obstetric wards •and almost 10,000 to gynecology wards http://www.straightstatistics.org/blog/2012/04/06/why-are-so-many-men-pregnant 16 © 2012 Pythian
  • 18. Scalable Storage + Massive Parallel Processing + Reasonable Cost © 2012 Pythian
  • 19. Hadoop: Platform for distributed computing © 2012 Pythian
  • 20. Hadoop is Scalable. But not fast. © 2012 Pythian
  • 21. Much Ado about Hadoop
  • 22. Assumptions • Lots of data • Large Files • Unstructured • Scan entire files • Unreliable Hardware • Adding servers = increase capacity © 2012 Pythian
  • 23. Principles • Bring Code to Data • Share Nothing © 2012 Pythian
  • 24. HDFS • Distributed • Replicated • Big Files • Write Once • Read Entire File © 2012 Pythian
  • 25. /users/shapira/log-1, blocks {1,4,5} /users/shapira/log-2, blocks {2,3,6} 1 4 5 2 3 1 4 5 2 4 1 3 2 3 6 6 5 6 © 2012 Pythian
  • 26. Map Reduce Combine Map Reduce Start Map Stop Job 1 Reduce? Job 1 … … Map Reduce? Hadoop Job Results Combine Map Reduce Start Map Reduce? Job 2 Stop … Job 1 … Map Reduce?
  • 27. Implementation • Balance disks, cores and RAM • High Bandwidth • More nodes or better nodes? © 2012 Pythian
  • 28. It’s about the Ecosystem • Sqoop • Flume • Hive • Pig • HBase © 2012 Pythian
  • 31. Use Case: ETL BI OLTP DWH © 2012 Pythian
  • 33. Use case: Listening to the crowd © 2012 Pythian
  • 34. Our customers use Hadoop for: • Storing lots of pre-processed data • Merging different data types • Scalable data processing • Advanced data processing 34 © 2012 Pythian
  • 35. Big Data in your Company
  • 36. Easy case: Your CTO heard about Big Data And is eager to invest. You have a Big Budget. © 2012 Pythian
  • 37. Require Measure Acquire Serve Organize Analyze © 2012 Pythian
  • 38. Require Hadoop Measure NoSQL OLTP BI, NoSQL, RDMB Oracle Hadoop BI, R © 2012 Pythian
  • 39. Data Scientist = Sneaky BI Disregards Silos Cool Toys © 2012 Pythian
  • 40. Mining Tools: • Machine Learning • Cluster Detection • Regression • Graph Analysis • Visualization © 2012 Pythian
  • 43. Want to do more with your data? Don’t know where to start? No budget? No problem! © 2012 Pythian
  • 44. Sneak Hadoop to Your Business • Find an important business problem • Acquire data (be sneaky!) • Get the tools: R, Hadoop, Tableau • Laptops, desktops, test servers • Analyze data • Make pretty charts • Get business used to it • Wait for an Outage • PROFIT! © 2012 Pythian
  • 45. Oracle Big Data The “ETL Machine”
  • 46. Hardware 18 servers 216 cores 864G RAM 648T disks Infiniband © 2012 Pythian
  • 47. Software Oracle NoSQL Cloudera Hadoop Distribution Oracle Loader for Hadoop Data Integrator for Hadoop Direct Connector for Hadoop Oracle Connector for R © 2012 Pythian
  • 48. Cores, Storage, Infiniband and Software Makes Oracle Big Data The Ultimate ETL Machine © 2012 Pythian
  • 49. Thank you & Q&A To contact us… sales@pythian.com 1-866-PYTHIAN To follow us… http://www.pythian.com/news/ http://www.facebook.com/pages/The-Pythian-Group/ http://twitter.com/pythian http://www.linkedin.com/company/pythian 49 © 2012 Pythian

Editor's Notes

  1. We are a managed service AND a solution provider of elite database and System Administration skills in Oracle, MySQL and SQL Server
  2. We want the data, the whole data and nothing but the data.
  3. You can no longer just throw one database at the problem and expect it to solve all your problems. Different parts of the solution require different technologies.I’ll talk mostly about Hadoop
  4. Bad schema design is not big dataUsing 8 year old hardware is not big dataNot having purging policy is not big dataNot configuring your database and operating system correctly is not big dataPoor data filtering is not big data eitherKeep the data you need and use. In a way that you can actually use it.If doing this requires cutting edge technology, excellent! But don’t tell me you need NoSQL because you don’t purge data and have un-optimized PL/SQL running on 10-yo hardware.
  5. We always wanted more data. We never wanted to have to aggregate and then delete old data. We knew we were missing details, subtleties, opportunities. But we had to – because we wanted better performance and couldn’t afford unlimited disks.With new technologies, more data is more feasible.
  6. One of the main reasons for the explosion of data stored in the last few years is that many problems are easier to solve if you apply more data to them.Take the Netflix Challenge for example. Netflix challenged the AI community to improve the movie recommendations made by Netflix to its customers based on a database of ratings and viewing history. Teams that used the available data more extensively did better than teams that used more advanced algorithms on a smaller data set.More data also allows businesses to make better, more informed decisions. Why have focus groups to decide on new store design, if you can re-design several stores and compare how customers proceeded through each store and how many left without buying? On-line stores make the process even easier.Modern businesses become more scientific and metrics driven, and rely less on “gut feeling” as the cost of making business experiments and measuring the results decrease.
  7. Data also arrives in more forms and from more sources than ever. Some of these don’t fit into a relational database very well, and for some, the relational database does not have the right tools to process the data.One of Pythian’s customers analyses social media sources and allow companies to find comments of their performance and service and respond to complaints via non-traditional customer support routes.Storing facebook comments and blog posts in Oracle for later processing, results in most of the data getting stored in BLOBs, where it is relatively difficult to manage. Most of the processing is done outside of Oracle using Nature Language Processing tools. So, why use Oracle for storage at all? Why not store and process the documents elsewhere and only store the ready-to-display results in Oracle?
  8. Data, especially from outside sources is not in a perfect condition to be useful to your business.Not only does it need to be processed into useful formats, it also needs:Filtering for potentially useful information. 99% of everything is crapStatistical analysis – is this data significant?Integration with existing dataEntity resolution. Is “Oracle Corp” the same as “Oracle” and “Oracle Corporation”? De-DuplicationGood processing and filtering of data can reduce the volume and variety of data. It is important to distinguish between true and accidental variety.This requires massive use of processing power. In a way, there is a trade-off between storage space and CPU. If you don’t invest CPU in filtering, de-duping and entity resolution – you’ll need more storage.
  9. Data warehouses require the data to be structured in a certain way, and it has to be structured that way before the data gets into the data warehouse. This means that we need to know all the questions we would like to answer with this data when designing the schema for the data warehouse.This works very well in many cases, but sometimes there are issues:The raw data is not relational – images, video, text and we want to keep raw data for future useThe requirements from the business frequently changeIn these cases it is better to store the data and create patterns from it as it is parsed and processed. This allows the business to move from large up-front design to just-in-time processing.For example: Astrometry project searches Flickr for photos of night sky, identifies the part of the sky its from and the prominent celestial bodies and creates a standard database of the position of elements in the sky.
  10. The new volume of data, and the need to transform it, filter it and clean it up require:Not only more storage, but also faster access ratesReliable storage. We want high availability and resilient systemsYou also need access to as many cores as you can get, to process all this dataThese cores should be as close to the data as possible to avoid moving large amounts of data on the netThe architecture should allow to use many of the cores in parallel for data processing
  11. Hadoop is the most common solution for the new Big Data requirement. It’s a scalable distributed file system, and a distributed job processing system on top of the file system.It is a PLATFORM, not a solution – so Hadoop is unlikely to make your life easier. A lot of querying and processing tasks are more difficult with Hadoop than without. But it makes previously expensive things cheaper, and previously impossible things, possible. This lets companies keep massive amounts of unstructured data and efficiently process it. The assumption behind Hadoop is that most jobs will want to scan entire data sets, not specific rows or columns. So efficient access to specific data is not a core capability.Hadoop is open source, and there is a large eco-system of tools, products and appliances built around it.Open source tools that make data processing on Hadoop easier and more accessible, BI and integration products, improved implementations of Hadoop that are faster or more reliable, Hadoop cloud services and hardware appliances.
  12. There is growing demand for:Real time analyticsServing data processed by Hadoop to customers with very low latency
  13. The exact balance depends on your workload – more CPU heavy?Just lots of data? Lots of disk bandwidth?More nodes: cheaper scalability, more resilient. But – higher cost of administration.
  14. .
  15. Modern data centers generate huge amounts of logs from applications and web services.These logs contain very specific information about how users are using our application and how the application performs.Hadoop is often used to answer questions like:How many users use each feature in my site?Which page do users usually go to after visiting page X?Do people return more often to my site after I made the new changes?What use patterns correlate with people who eventually buy a product?What is the correlation between slow performance and purchase rates?Note that the web logs can be processed, loaded into RDBMS and parsed there. However, we are talking about very large amounts of data, and each piece of data needs to be read just once to answer each question. There are very few relations there. Why bother loading all this to RDBMS?
  16. Hadoop has large storage, high bandwidth, lots of cores and was build for data aggregation.Also, it is cheap.Data is dumped from the OLTP database (Oracle or MySQL) to Hadoop. Transformation code is written on Hadoop to aggregate the data (this is the tricky part) and the data is loaded to the data warehouse (usually Oracle).This is such a common use case that Oracle built an appliance especially for this.
  17. A lot of the modern web experience revolves around websites being about to predict what you’ll do next or what you’d like to do but don’t know about yet.People you may knowJobs you may be interested inOther customers who looked at this product eventually bought…These emails are more important than othersTo generate this information, usage patterns are extracted from OLTP databases and logs, the data is analyzed, and the results are loaded to an OLTP database again for use by the customer.The analysis task started out as daily batch job, but soon users expected more immediate feedback. More processing resources were brought in to speed up the process. Then the system started incorporating customer feedback into the analysis when making new recommendations. This new information needed more storage and more processing power.
  18. One customer uses tweeter as input for customer support. They search tweeter and save all feeds that mention their customers (say, AT&T) to their Hadoop cluster, mine them for relevant information (user, location, what is the problem, how popular he is). They then open tickets in traditional customer support system based on this information. They can also go back and mine all the saved data for re-occurring complaints, problem areas, etc.Another use case is analysts and marketing departments who mine blogs and job postings to find trending topics.
  19. Start with well-defined and obviously important requirement
  20. Oracle’s Big Data machine was built to move data between Oracle RDBMS and Hadoop fast, and I doubt if anyone can beat Oracle at that.Both the tools that are bundled with the machine and the fast IB connection to Exadata make it very attractive for businesses wishing to use Hadoop as ETL solution. Note that the tools should also be avba