Big data
The technology landscape and its applications.




                                                 Natalino Busa - 12 Feb. 2013
Outline


          ● Big Data: Who are thou?
          ● Big Data: The technology landscape

          ● Hadoop: Overview
          ● Analytics & Machine Learning
          ● Opportunities




                                            Natalino Busa - 12 Feb. 2013
Hype cycle on new IT technologies

                                    Gartner 2012




                                    Natalino Busa - 12 Feb. 2013
What is big data?

        DATA (structured and un-structured, Logs, ETL, social)


            Velocity               Diversity                Volume




                        BIG DATA


           Hardware                Software                Services

      Infrastructure            Marketing (e.g. Unica)    RDBMS
      (Private) Cloud           Analytics (Tableau)       OLAP
      Networking                Modeling (SAS)            Messaging



                                                                      Natalino Busa - 12 Feb. 2013
Big Data Heat map




                    Natalino Busa - 12 Feb. 2013
How big is big?

SkyTree (tm) defines: Analytics Requirements Index (ARI)

                                 ARI = # Rows × # Columns
                                          Time (secs)


Where          # Rows =                   Number of records being analyzed

               # Columns =                Number of variables captured in each record

               Time (secs) =              The timeframe within which to complete the analysis




 Example: For each view (1000 views/sec) produce a personalized banner
 I need to analyze 100 variables on 1000 records (historic data) every 1 ms

 ARI = (1000*100)/0.001 = 100 M values/sec




                                                                                  Natalino Busa - 12 Feb. 2013
What data?

Big Data can imply:


           ●   Complex Data refactoring in Batch                  (lots of rows)
           ●   Real-Time Event Processing                         (high-speed responses)
           ●   Multidimensional analisys                          (lots of parameters)

           ●   ... or any of those three
                                           Response
                                           time




                                                  Pa
                                                    ram
                                                          ete              s
                                                             rs       titie
                                                                    En

                                                                               Natalino Busa - 12 Feb. 2013
More data

                                                                           customers +
                                                         customers +       products +
                                  customers +            products +        surveys +
                customers +       products +             surveys +         transactions +
customers       products          surveys                transactions      social messages


Database        Databases         Federated Data         Aggregated Data   Linked Data            Just Data


Structured                                                                                   Unstructured



   ●    in today's IT environments there is a gradual shift
        from structured data to unstructured data

 RDBMS are well suited to deal with structured data ->
   but: more and complex ETL, how to deal with new data (structures) ?

 Map-Reduce and noSQL systems are good with unstructured data ->
  but: how to we query and analyze this data?



                                                                                 Natalino Busa - 12 Feb. 2013
Big Data: how to deal with it



        ●   Big Data at rest     (storage, access)
        ●   Big Data in motion   (streaming, dataflows)


        ●   Big Data analytics   (OLAP, OTAP, BI)
        ●   Big Data modeling    (predictive, machine learning)




                                                          Natalino Busa - 12 Feb. 2013
Big Data at rest

Analytical RDBMSs                (EDW) Oracle, IBM, and various MPP's

Hadoop Distributed Systems       HDFS (distributed file system)
                                 Hbase (Big Table)




                  Batch      Real-time

                 Cassandra       HBase                            Analytics

      Logs                HDFS                 EDW                  EDW       EDW




  ●   Traditional EDW and Distributed             ●   These systems do not exclude each
      BigData / NoSQL solutions are                   others and can coexist to form a full
      complementary to each other.                    enterprise level solution.


                                                                               Natalino Busa - 12 Feb. 2013
Big Data at rest

No need to get everything out of the hadoop ecosystem:

NoSQL DBMSs:            Couchbase ( ++ reads, caching)
                         Cassandra ( ++ writes, OLAP)

... hybrid solutions are also possible:

HDFS + Cassandra : in-memory analytics + large DFS
HDFS + Solr/Lucene: fast text search on a distributed file system




                                                                    Natalino Busa - 12 Feb. 2013
Big Data in motion

Stream processing // Dataflow architectures

Used to support the automatic analysis of data-in-motion in real-time or near real-time.

- Identify meaningful patterns
- Trigger action to respond to them as quickly as possible.



                                                       - Storm (from twitter)
                                                         dataflow processing framework
                                                         ++ multi-language

                                                       - Akka (from typesafe)
                                                         dataflow actor framework
                                                         ++ speed


                                                       Both are:
                                                       Distributed, fault-tolerant, streaming



                                                                                   Natalino Busa - 12 Feb. 2013
Big Data Landscape

                                           Machine Learning on Big Data



                    Unstructured
                                    SAS, R over HDFS                Mahout


                           REST
                  Logs     flume                 Hbase                    Hive
Data Interfaces




                           scribe                                                      ●   Batch Analytics
                                    HDFS                                               ●   Visualization
                                                               MapR              BI
                                                                                       ●   Monitoring
                                                                                       ●   Marketing
                           sqoop              Cassandra                   Pig
                  EDW
                           hiho

                    Unstructured
                                     FS          OLAP            OTAP Impala
                                                                                  ●   Real-Time Analytics
                                                                                  ●   Streaming
                                              STORM

                                                                                 Natalino Busa - 12 Feb. 2013
Lambda Architecture




                                    Logic layer
                                                   Software as a Service
                                                   e.g realt-time predictor




from http://www.manning.com/marz/
                                                  Natalino Busa - 12 Feb. 2013
Why do machine learning on big data




    http://www.skytree.net/why-do-machine-learning-on-big-data/



                                                                  Natalino Busa - 12 Feb. 2013
Machine Learning: What?
          SIMILARITY SEARCH
          Similarity search provides a way to find the
          objects that are the most similar, in an overall
          sense, to the object(s) of interest.


                                         PREDICTIVE ANALYTICS
                                         Predictive analytics is the science of analyzing current and
                                         historical facts/data to make predictions about future events.



             CLUSTERING AND SEGMENTATION
             Cluster analysis and segmentation represents a purely data
             driven approach to grouping similar objects, behaviors, or
             whatever is represented by the data.


From http://www.skytree.net/why-do-machine-learning-on-big-data/use-cases/                   Natalino Busa - 12 Feb. 2013
Word Counting on Map Reduce




                              Natalino Busa - 12 Feb. 2013
Machine learning on Map Reduce




     From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011




                                                                          Natalino Busa - 12 Feb. 2013
Machine learning on Map Reduce




From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011   Natalino Busa - 12 Feb. 2013
Machine Learning: Use Cases

 E-Commerce / E-Tailing
 ● Product Recommendation Engines
 ● Cross Channel Analytics
 ● Events/Activity Behavior Segmentation

 Product Marketing
 ● Campaign management and optimization
 ● Market and consumer segmentations
 ● Pricing Optimization

 Customer Marketing
 ● Customer Churn Management
 ● (Mobile) User Behavior Prediction
 ● Offer Personalization


                                           Natalino Busa - 12 Feb. 2013
Big Data: Opportunities

 Unstructured Data
 ● Clustering
 ● Distributed processing
 ● Distributed Storage

 Modeling & Analytics
 ● Distributed Machine Learning
 ● Fast Online Analytics Cubes

 Streaming and Real-Time processing
 ● Build RT profiles
 ● Decision trees and Predictions
 ● Offer Personalization



                                      Natalino Busa - 12 Feb. 2013
Thanks


         linkedin:
         www.linkedin.com/in/natalinobusa

         blog:
         www.natalinobusa.com

Big data landscape

  • 1.
    Big data The technologylandscape and its applications. Natalino Busa - 12 Feb. 2013
  • 2.
    Outline ● Big Data: Who are thou? ● Big Data: The technology landscape ● Hadoop: Overview ● Analytics & Machine Learning ● Opportunities Natalino Busa - 12 Feb. 2013
  • 3.
    Hype cycle onnew IT technologies Gartner 2012 Natalino Busa - 12 Feb. 2013
  • 4.
    What is bigdata? DATA (structured and un-structured, Logs, ETL, social) Velocity Diversity Volume BIG DATA Hardware Software Services Infrastructure Marketing (e.g. Unica) RDBMS (Private) Cloud Analytics (Tableau) OLAP Networking Modeling (SAS) Messaging Natalino Busa - 12 Feb. 2013
  • 5.
    Big Data Heatmap Natalino Busa - 12 Feb. 2013
  • 6.
    How big isbig? SkyTree (tm) defines: Analytics Requirements Index (ARI) ARI = # Rows × # Columns Time (secs) Where # Rows = Number of records being analyzed # Columns = Number of variables captured in each record Time (secs) = The timeframe within which to complete the analysis Example: For each view (1000 views/sec) produce a personalized banner I need to analyze 100 variables on 1000 records (historic data) every 1 ms ARI = (1000*100)/0.001 = 100 M values/sec Natalino Busa - 12 Feb. 2013
  • 7.
    What data? Big Datacan imply: ● Complex Data refactoring in Batch (lots of rows) ● Real-Time Event Processing (high-speed responses) ● Multidimensional analisys (lots of parameters) ● ... or any of those three Response time Pa ram ete s rs titie En Natalino Busa - 12 Feb. 2013
  • 8.
    More data customers + customers + products + customers + products + surveys + customers + products + surveys + transactions + customers products surveys transactions social messages Database Databases Federated Data Aggregated Data Linked Data Just Data Structured Unstructured ● in today's IT environments there is a gradual shift from structured data to unstructured data RDBMS are well suited to deal with structured data -> but: more and complex ETL, how to deal with new data (structures) ? Map-Reduce and noSQL systems are good with unstructured data -> but: how to we query and analyze this data? Natalino Busa - 12 Feb. 2013
  • 9.
    Big Data: howto deal with it ● Big Data at rest (storage, access) ● Big Data in motion (streaming, dataflows) ● Big Data analytics (OLAP, OTAP, BI) ● Big Data modeling (predictive, machine learning) Natalino Busa - 12 Feb. 2013
  • 10.
    Big Data atrest Analytical RDBMSs (EDW) Oracle, IBM, and various MPP's Hadoop Distributed Systems HDFS (distributed file system) Hbase (Big Table) Batch Real-time Cassandra HBase Analytics Logs HDFS EDW EDW EDW ● Traditional EDW and Distributed ● These systems do not exclude each BigData / NoSQL solutions are others and can coexist to form a full complementary to each other. enterprise level solution. Natalino Busa - 12 Feb. 2013
  • 11.
    Big Data atrest No need to get everything out of the hadoop ecosystem: NoSQL DBMSs: Couchbase ( ++ reads, caching) Cassandra ( ++ writes, OLAP) ... hybrid solutions are also possible: HDFS + Cassandra : in-memory analytics + large DFS HDFS + Solr/Lucene: fast text search on a distributed file system Natalino Busa - 12 Feb. 2013
  • 12.
    Big Data inmotion Stream processing // Dataflow architectures Used to support the automatic analysis of data-in-motion in real-time or near real-time. - Identify meaningful patterns - Trigger action to respond to them as quickly as possible. - Storm (from twitter) dataflow processing framework ++ multi-language - Akka (from typesafe) dataflow actor framework ++ speed Both are: Distributed, fault-tolerant, streaming Natalino Busa - 12 Feb. 2013
  • 13.
    Big Data Landscape Machine Learning on Big Data Unstructured SAS, R over HDFS Mahout REST Logs flume Hbase Hive Data Interfaces scribe ● Batch Analytics HDFS ● Visualization MapR BI ● Monitoring ● Marketing sqoop Cassandra Pig EDW hiho Unstructured FS OLAP OTAP Impala ● Real-Time Analytics ● Streaming STORM Natalino Busa - 12 Feb. 2013
  • 14.
    Lambda Architecture Logic layer Software as a Service e.g realt-time predictor from http://www.manning.com/marz/ Natalino Busa - 12 Feb. 2013
  • 15.
    Why do machinelearning on big data http://www.skytree.net/why-do-machine-learning-on-big-data/ Natalino Busa - 12 Feb. 2013
  • 16.
    Machine Learning: What? SIMILARITY SEARCH Similarity search provides a way to find the objects that are the most similar, in an overall sense, to the object(s) of interest. PREDICTIVE ANALYTICS Predictive analytics is the science of analyzing current and historical facts/data to make predictions about future events. CLUSTERING AND SEGMENTATION Cluster analysis and segmentation represents a purely data driven approach to grouping similar objects, behaviors, or whatever is represented by the data. From http://www.skytree.net/why-do-machine-learning-on-big-data/use-cases/ Natalino Busa - 12 Feb. 2013
  • 17.
    Word Counting onMap Reduce Natalino Busa - 12 Feb. 2013
  • 18.
    Machine learning onMap Reduce From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011 Natalino Busa - 12 Feb. 2013
  • 19.
    Machine learning onMap Reduce From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011 Natalino Busa - 12 Feb. 2013
  • 20.
    Machine Learning: UseCases E-Commerce / E-Tailing ● Product Recommendation Engines ● Cross Channel Analytics ● Events/Activity Behavior Segmentation Product Marketing ● Campaign management and optimization ● Market and consumer segmentations ● Pricing Optimization Customer Marketing ● Customer Churn Management ● (Mobile) User Behavior Prediction ● Offer Personalization Natalino Busa - 12 Feb. 2013
  • 21.
    Big Data: Opportunities Unstructured Data ● Clustering ● Distributed processing ● Distributed Storage Modeling & Analytics ● Distributed Machine Learning ● Fast Online Analytics Cubes Streaming and Real-Time processing ● Build RT profiles ● Decision trees and Predictions ● Offer Personalization Natalino Busa - 12 Feb. 2013
  • 22.
    Thanks linkedin: www.linkedin.com/in/natalinobusa blog: www.natalinobusa.com