SlideShare a Scribd company logo
1 of 34
Learning Linear Models
           with Hadoop
           Ulrich Rückert




                             © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Agenda

                What are linear models anyway?
                How to learn linear models with Hadoop
                Demo
                Tips, tricks and caveats
                Conclusion




                                                 © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Predictive Analytics
                                                                                                   Test Data
                                                                                         Age             Income   BuysBook
                                                                Target                   22               67000      ?
           Example Learning                   Attributes       Attribute                  39             41000       ?

           Task                                  Age       Income        BuysBook
                                                 24        60000             yes
           • Ad on booksellerʼs web page         65        80000             no
                                                 60        95000             no
           • Will a customer buy this book?      35        52000             yes

           • Training set: observations on       20
                                                 43
                                                           45000
                                                           75000
                                                                             yes
                                                                             yes
                                                                                                        Model
                previous customers               26        51000             yes
                                                 52        47000             no
           •    Test set: new customers          47        38000              no
                                                 25        22000              no

            Letʼs learn a linear                 33        47000             yes


            model!                                 Training Data                         Age
                                                                                         22
                                                                                                         Income
                                                                                                          67000
                                                                                                                  BuysBook
                                                                                                                    yes
                                                                                          39             41000       no

                                                                                                Prediction



                                                                    © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Linear Models
                                                                Expert1        Expert2         BuysBook
                                                                  24             60               ?
                                                                  64              80                ?
                                                                  60              96                ?
           Whatʼs in the black box?
           • Letʼs pretend all attributes are
                expert ratings
           •    Large positive value means yes
           •    Small value means no                 Expert 1             Expert 2                             Prediction

           •    Intermediate value: donʼt know         24
                                                       65
                                                                            60
                                                                            80
                                                                                                                   ?
                                                                                                                   ?
                                                       60                   95                                     ?
            Let the experts vote
           •    Sum over ratings for each row
           •    Larger than threshold: predict yes              Expert1
                                                                  24
                                                                               Expert2
                                                                                  60
                                                                                               Prediction
                                                                                                    ?

           •    Smaller: predict no                               64
                                                                  60
                                                                                  80
                                                                                  96
                                                                                                    ?
                                                                                                    ?




                                                                  © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Linear Models
                                                                Expert1           Expert2         BuysBook
                                                                  24                60               ?
                                                                  64                 80                ?
                                                                  60                 96                ?
           Whatʼs in the black box?
           • Letʼs pretend all attributes are
                expert ratings                                                                                    Threshold


           •    Large positive value means yes                                                                       97

           •    Small value means no                 Expert 1              Expert 2                               > threshold

           •    Intermediate value: donʼt know         24
                                                       65
                                                                 +
                                                                 +
                                                                               60
                                                                               80
                                                                                             =
                                                                                             =
                                                                                                    84
                                                                                                    145
                                                                                                                     no
                                                                                                                     yes
                                                       60        +             95            =      155              yes
            Let the experts vote
           •    Sum over ratings for each row
           •    Larger than threshold: predict yes              Expert1
                                                                  24
                                                                                  Expert2
                                                                                     60
                                                                                                  Prediction
                                                                                                      no

           •    Smaller: predict no                               64
                                                                  60
                                                                                     80
                                                                                     96
                                                                                                      yes
                                                                                                      yes




                                                                     © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Linear Models
                                                              Expert1           Expert2         BuysBook
                                                                24                60               ?
                                                                64                 80                ?
           Assign a weight to each                              60                 96                ?

           expert
           • Expert is mostly correct: large      Weight 1               Weight 2                               Threshold
                weight
                                                   0.75                   0.25                                     48
           •    Expert is uninformative: zero
           •    Expert is consistently wrong:     Expert 1               Expert 2                               > threshold

                negative weight                   0.75 • 24    +         0.25 • 60         =      33               no
                                                  0.75 • 64    +         0.25 • 80         =      68               yes
                                                  0.75 • 60    +         0.25 • 96         =      69               yes
            Learning models
           •    A linear model contains weights
                and threshold                                 Expert1           Expert2         Prediction
                                                                24                 60               no
           •    Learn by finding weights with                    64                 80               yes
                lowest error on training data                   60                 96               yes




                                                                   © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Linear Models
                                                             Expert1           Expert2         BuysBook
                                                               24                60               ?
                                                               64                 80                ?
           Assign a weight to each                             60                 96                ?

           expert
           • Expert is mostly correct: large      Weight 1              Weight 2                               Threshold
                weight
                                                     0                   0.25                                     18
           •    Expert is uninformative: zero
           •    Expert is consistently wrong:     Expert 1              Expert 2                               > threshold

                negative weight                    0 • 24     +         0.25 • 60         =      15               no
                                                   0 • 64     +         0.25 • 80         =      20               yes
                                                   0 • 60     +         0.25 • 96         =      24               yes
            Learning models
           •    A linear model contains weights
                and threshold                                Expert1           Expert2         Prediction
                                                               24                 60               no
           •    Learn by finding weights with                   64                 80               yes
                lowest error on training data                  60                 96               yes




                                                                  © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Linear Models
                                                              Expert1           Expert2         BuysBook
                                                                24                60               ?
                                                                64                 80                ?
           Assign a weight to each                              60                 96                ?

           expert
           • Expert is mostly correct: large      Weight 1               Weight 2                               Threshold
                weight
                                                   -0.5                   0.25                                      -8
           •    Expert is uninformative: zero
           •    Expert is consistently wrong:     Expert 1               Expert 2                               > threshold

                negative weight                   -0.5 • 24    +         0.25 • 60         =        3              yes
                                                  -0.5 • 64    +         0.25 • 80         =      -12              no
                                                  -0.5 • 60    +         0.25 • 96         =       -6              yes
            Learning models
           •    A linear model contains weights
                and threshold                                 Expert1           Expert2         Prediction
                                                                24                 60               yes
           •    Learn by finding weights with                    64                 80               no
                lowest error on training data                   60                 96               yes




                                                                   © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
           Stochastic Gradient                        Start with default weights
           Decent (SGD)
           • Main idea: start with default
                weights                                Read next training row

           •    For each row check if current
                weights predict correctly
           •    If misclassification: adjust weights    Do weights predict the
                                                           correct label?
                                                                                                 Yes


            How to adjust weights?
                                                                        No
           •    if positive class: add row
                                                           Adjust weights
           •    if negative class: subtract row




                                                            © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                    1               -1                                  0

                                   Age            Income                        > threshold

                                   1•?      +      -1 • ?       =     ?                 ?




                                            Age        Income       BuysBook
                                            24           60            +1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                     1              -1                                  0

                                   Age            Income                        > threshold

                                   1 • 24   +      -1 • 60      =    -36                -1




                                            Age        Income       BuysBook
                                            24           60            +1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                   25               59                                  0

                                   Age            Income                        > threshold

                                  25 • 24   +     59 • 60       =   4140               +1




                                            Age        Income       BuysBook
                                            24           60            +1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                   25               59                                 -1

                                   Age            Income                        > threshold

                                  25 • 24   +     59 • 60       =   4140               +1




                                            Age        Income       BuysBook
                                            24           60            +1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                    25              59                                 -1

                                   Age            Income                        > threshold

                                   25 • ?   +      59 • ?       =     ?                 ?




                                            Age        Income       BuysBook
                                            30           30            -1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                   25               59                                 -1

                                   Age            Income                        > threshold

                                  25 • 30   +     59 • 30       =   2520               +1




                                            Age        Income       BuysBook
                                            30           30            -1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                    -5              29                                 -1

                                   Age            Income                        > threshold

                                  -5 • 30   +     29 • 30       =    720               +1




                                            Age        Income       BuysBook
                                            30           30            -1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                    -5              29                                  0

                                   Age            Income                        > threshold

                                  -5 • 30   +     29 • 30       =    720               +1




                                            Age        Income       BuysBook
                                            30           30            -1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Convergence
       repeat
          row = readNextRow();
          if(predict(weights, row.attributes) != row.class)
                   weights += row.class * row.attributes;
                   threshold += -row.class;
             endif
       end




                                                            © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Convergence
       repeat
          row = readNextRow();
          if(predict(weights, row.attributes) != row.class)
                   weights += row.class * row.attributes;
                   threshold += -row.class;
             endif
       end




                                                            © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Convergence
       repeat
             row = readNextRow();
             if(predict(weights, row.attributes) != row.class)
                    weights += 0.001 * row.class * row.attributes;
                 threshold += -row.class;
              endif
        end




                                                                © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Convergence
       repeat
             row = readNextRow();
             if(predict(weights, row.attributes) != row.class)
                    weights += 0.001 * row.class * row.attributes;
                 threshold += -row.class;
              endif
        end




                                                                © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Convergence
       for i=1 to ∞
              row = readNextRow();
              if(predict(weights, row.attributes) != row.class)
                    weights += (1/i) * row.class * row.attributes;
                 threshold += -row.class;
              endif
        end




                                                                © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Convergence
       for i=1 to ∞
              row = readNextRow();
              if(predict(weights, row.attributes) != row.class)
                    weights += (1/i) * row.class * row.attributes;
                 threshold += -row.class;
              endif
        end




                                                                © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Margin
      for i = 1 to ∞
         row = readNextRow();
         if(margin(weights, row.attributes, threshold) <= 1)
            weights += (1/n) * row.class * row.attributes;
               threshold += -row.class;
            endif
      end


            Weight 1             Weight 2                      Threshold

               0.5                0.25                           26.5

               Age               Income            Margin      > threshold

             0.5 • 24      +     0.25 • 60     =      27           +1




                           Age        Income        BuysBook
                           24           60             +1




                                                                             © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Margin
      for i = 1 to ∞
         row = readNextRow();
         if(margin(weights, row.attributes, threshold) <= 1)
            weights += (1/n) * row.class * row.attributes;
               threshold += -row.class;
            endif
      end


            Weight 1             Weight 2                      Threshold

               0.5                0.25                           26.5

               Age               Income            Margin      > threshold

             0.5 • 24      +     0.25 • 60     =      27           +1




                           Age        Income        BuysBook
                           24           60             +1




                                                                             © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Margin
      for i = 1 to ∞
         row = readNextRow();
         if(margin(weights, row.attributes, threshold) <= 1)
            weights += (1/n) * row.class * row.attributes;
               threshold += -row.class;
            endif
      end


            Weight 1             Weight 2                      Threshold

               0.5                0.25                           26.5

               Age               Income            Margin      > threshold

             0.5 • 24      +     0.25 • 60     =      27           +1




                           Age        Income        BuysBook
                           24           60             +1




                                                                             © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Regularization
                                                    for i = 1 to ∞
           Attributes are often                        row = readNextRow();
                                                       if(margin(weights, row.attributes, threshold) <= 1)
           correlated                                        weights += (1/n) * row.class * row.attributes;
                                                             threshold += -row.class;
           • Contributions cancel out                     endif
                                                    end
           • This leads to unreasonably
                large weights...
           •    ... and models which are not              Weight 1             Weight 2                                Threshold
                robust to noise
                                                            0.5                   0.5                                     30
            Regularization                                  Age                 Income                                 > threshold


           •    Make sure weights donʼt get too            0.5 • 24   +         0.5 • 60           =       42              +1

                large
           •    L2 regularization: weights are                        Age              Income           BuysBook
                proportional to attribute quality                     24                   60                +1




                                                                          © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Regularization
                                                    for i = 1 to ∞
           Attributes are often                        row = readNextRow();
                                                       if(margin(weights, row.attributes, threshold) <= 1)
           correlated                                        weights += (1/n) * row.class * row.attributes;
                                                             threshold += -row.class;
           • Contributions cancel out                     endif
                                                    end
           • This leads to unreasonably
                large weights...
           •    ... and models which are not              Weight 1             Weight 2                                Threshold
                robust to noise
                                                           1000                -399.3                                     30
            Regularization                                  Age                 Income                                 > threshold


           •    Make sure weights donʼt get too           1000 • 24   +       -399.3 • 60          =       42              +1

                large
           •    L2 regularization: weights are                        Age              Income           BuysBook
                proportional to attribute quality                     24                   60                +1




                                                                          © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Regularization
                                                    for i = 1 to ∞
           Attributes are often                        row = readNextRow();
                                                       if(margin(weights, row.attributes, threshold) <= 1)
           correlated                                        weights += (1/n) * row.class * row.attributes;
                                                             threshold += -row.class;
           • Contributions cancel out                     endif

           • This leads to unreasonably             end
                                                          weights = i/(i+r) * weights;

                large weights...
           •    ... and models which are not              Weight 1             Weight 2                                Threshold
                robust to noise
                                                           1000                -399.3                                     30
            Regularization                                  Age                 Income                                 > threshold


           •    Make sure weights donʼt get too           1000 • 24   +       -399.3 • 60          =       42              +1

                large
           •    L2 regularization: weights are                        Age              Income           BuysBook
                proportional to attribute quality                     24                   60                +1




                                                                          © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Implementation on Hadoop

                 Map-Reduce
                  •    Input data must be in random order

                  •    Mapper: send data to reducer in random order

                  •    Reducer: run the actual Stochastic Gradient Descent

                 Evaluation and Parameter Selection
                  •    Perform several runs with varying parameters

                  •    Learn on training set, evaluate on test set

                  •    Many runs with with partial data often better than one run with all data




                                                                       © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Demo



                                  © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models

                 Stochastic Gradient Descent: Pros and Cons
                  •    One sweep over the data: easy to implement on top of Hadoop

                  •    Flexible: support vector machines, logistic regression, etc.

                  •    Provides good enough estimate instead of optimum

                  •    Parameter selection and evaluation are crucial

                 Alternative: convex optimization
                  •    Formulate learning as numerical optimization problem

                  •    On Hadoop: usually LBFGS

                  •    See Vowpal Wobbit for a large scale implementation


                                                                        © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Conclusion

                 Linear Models
                  •    Prediction based on weighted vote and threshold

                 Stochastic Gradient Descent
                  •    Adjust weight vector iteratively for each misclassified row

                  •    Decreasing step size to ensure convergence

                  •    Margins and regularization for robustness

                 Implementation
                  •    Mapper provides random order, reducer performs SGD

                  •    Evaluation and parameter selection are crucial

                                                                        © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Thanks
                           urueckert@datameer.com




                                                    © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013

More Related Content

More from DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesDataWorks Summit
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentDataWorks Summit
 

More from DataWorks Summit (20)

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
 

Recently uploaded

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Recently uploaded (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

Learning Linear Models with Hadoop

  • 1. Learning Linear Models with Hadoop Ulrich Rückert © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 2. Agenda What are linear models anyway? How to learn linear models with Hadoop Demo Tips, tricks and caveats Conclusion © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 3. Predictive Analytics Test Data Age Income BuysBook Target 22 67000 ? Example Learning Attributes Attribute 39 41000 ? Task Age Income BuysBook 24 60000 yes • Ad on booksellerʼs web page 65 80000 no 60 95000 no • Will a customer buy this book? 35 52000 yes • Training set: observations on 20 43 45000 75000 yes yes Model previous customers 26 51000 yes 52 47000 no • Test set: new customers 47 38000 no 25 22000 no Letʼs learn a linear 33 47000 yes model! Training Data Age 22 Income 67000 BuysBook yes 39 41000 no Prediction © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 4. Linear Models Expert1 Expert2 BuysBook 24 60 ? 64 80 ? 60 96 ? Whatʼs in the black box? • Letʼs pretend all attributes are expert ratings • Large positive value means yes • Small value means no Expert 1 Expert 2 Prediction • Intermediate value: donʼt know 24 65 60 80 ? ? 60 95 ? Let the experts vote • Sum over ratings for each row • Larger than threshold: predict yes Expert1 24 Expert2 60 Prediction ? • Smaller: predict no 64 60 80 96 ? ? © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 5. Linear Models Expert1 Expert2 BuysBook 24 60 ? 64 80 ? 60 96 ? Whatʼs in the black box? • Letʼs pretend all attributes are expert ratings Threshold • Large positive value means yes 97 • Small value means no Expert 1 Expert 2 > threshold • Intermediate value: donʼt know 24 65 + + 60 80 = = 84 145 no yes 60 + 95 = 155 yes Let the experts vote • Sum over ratings for each row • Larger than threshold: predict yes Expert1 24 Expert2 60 Prediction no • Smaller: predict no 64 60 80 96 yes yes © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 6. Linear Models Expert1 Expert2 BuysBook 24 60 ? 64 80 ? Assign a weight to each 60 96 ? expert • Expert is mostly correct: large Weight 1 Weight 2 Threshold weight 0.75 0.25 48 • Expert is uninformative: zero • Expert is consistently wrong: Expert 1 Expert 2 > threshold negative weight 0.75 • 24 + 0.25 • 60 = 33 no 0.75 • 64 + 0.25 • 80 = 68 yes 0.75 • 60 + 0.25 • 96 = 69 yes Learning models • A linear model contains weights and threshold Expert1 Expert2 Prediction 24 60 no • Learn by finding weights with 64 80 yes lowest error on training data 60 96 yes © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 7. Linear Models Expert1 Expert2 BuysBook 24 60 ? 64 80 ? Assign a weight to each 60 96 ? expert • Expert is mostly correct: large Weight 1 Weight 2 Threshold weight 0 0.25 18 • Expert is uninformative: zero • Expert is consistently wrong: Expert 1 Expert 2 > threshold negative weight 0 • 24 + 0.25 • 60 = 15 no 0 • 64 + 0.25 • 80 = 20 yes 0 • 60 + 0.25 • 96 = 24 yes Learning models • A linear model contains weights and threshold Expert1 Expert2 Prediction 24 60 no • Learn by finding weights with 64 80 yes lowest error on training data 60 96 yes © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 8. Linear Models Expert1 Expert2 BuysBook 24 60 ? 64 80 ? Assign a weight to each 60 96 ? expert • Expert is mostly correct: large Weight 1 Weight 2 Threshold weight -0.5 0.25 -8 • Expert is uninformative: zero • Expert is consistently wrong: Expert 1 Expert 2 > threshold negative weight -0.5 • 24 + 0.25 • 60 = 3 yes -0.5 • 64 + 0.25 • 80 = -12 no -0.5 • 60 + 0.25 • 96 = -6 yes Learning models • A linear model contains weights and threshold Expert1 Expert2 Prediction 24 60 yes • Learn by finding weights with 64 80 no lowest error on training data 60 96 yes © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 9. Learning Linear Models Stochastic Gradient Start with default weights Decent (SGD) • Main idea: start with default weights Read next training row • For each row check if current weights predict correctly • If misclassification: adjust weights Do weights predict the correct label? Yes How to adjust weights? No • if positive class: add row Adjust weights • if negative class: subtract row © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 10. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 1 -1 0 Age Income > threshold 1•? + -1 • ? = ? ? Age Income BuysBook 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 11. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 1 -1 0 Age Income > threshold 1 • 24 + -1 • 60 = -36 -1 Age Income BuysBook 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 12. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 25 59 0 Age Income > threshold 25 • 24 + 59 • 60 = 4140 +1 Age Income BuysBook 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 13. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 25 59 -1 Age Income > threshold 25 • 24 + 59 • 60 = 4140 +1 Age Income BuysBook 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 14. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 25 59 -1 Age Income > threshold 25 • ? + 59 • ? = ? ? Age Income BuysBook 30 30 -1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 15. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 25 59 -1 Age Income > threshold 25 • 30 + 59 • 30 = 2520 +1 Age Income BuysBook 30 30 -1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 16. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold -5 29 -1 Age Income > threshold -5 • 30 + 29 • 30 = 720 +1 Age Income BuysBook 30 30 -1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 17. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold -5 29 0 Age Income > threshold -5 • 30 + 29 • 30 = 720 +1 Age Income BuysBook 30 30 -1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 18. Learning - Convergence repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 19. Learning - Convergence repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 20. Learning - Convergence repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += 0.001 * row.class * row.attributes; threshold += -row.class; endif end © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 21. Learning - Convergence repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += 0.001 * row.class * row.attributes; threshold += -row.class; endif end © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 22. Learning - Convergence for i=1 to ∞ row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += (1/i) * row.class * row.attributes; threshold += -row.class; endif end © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 23. Learning - Convergence for i=1 to ∞ row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += (1/i) * row.class * row.attributes; threshold += -row.class; endif end © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 24. Learning - Margin for i = 1 to ∞ row = readNextRow(); if(margin(weights, row.attributes, threshold) <= 1) weights += (1/n) * row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 0.5 0.25 26.5 Age Income Margin > threshold 0.5 • 24 + 0.25 • 60 = 27 +1 Age Income BuysBook 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 25. Learning - Margin for i = 1 to ∞ row = readNextRow(); if(margin(weights, row.attributes, threshold) <= 1) weights += (1/n) * row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 0.5 0.25 26.5 Age Income Margin > threshold 0.5 • 24 + 0.25 • 60 = 27 +1 Age Income BuysBook 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 26. Learning - Margin for i = 1 to ∞ row = readNextRow(); if(margin(weights, row.attributes, threshold) <= 1) weights += (1/n) * row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 0.5 0.25 26.5 Age Income Margin > threshold 0.5 • 24 + 0.25 • 60 = 27 +1 Age Income BuysBook 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 27. Learning - Regularization for i = 1 to ∞ Attributes are often row = readNextRow(); if(margin(weights, row.attributes, threshold) <= 1) correlated weights += (1/n) * row.class * row.attributes; threshold += -row.class; • Contributions cancel out endif end • This leads to unreasonably large weights... • ... and models which are not Weight 1 Weight 2 Threshold robust to noise 0.5 0.5 30 Regularization Age Income > threshold • Make sure weights donʼt get too 0.5 • 24 + 0.5 • 60 = 42 +1 large • L2 regularization: weights are Age Income BuysBook proportional to attribute quality 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 28. Learning - Regularization for i = 1 to ∞ Attributes are often row = readNextRow(); if(margin(weights, row.attributes, threshold) <= 1) correlated weights += (1/n) * row.class * row.attributes; threshold += -row.class; • Contributions cancel out endif end • This leads to unreasonably large weights... • ... and models which are not Weight 1 Weight 2 Threshold robust to noise 1000 -399.3 30 Regularization Age Income > threshold • Make sure weights donʼt get too 1000 • 24 + -399.3 • 60 = 42 +1 large • L2 regularization: weights are Age Income BuysBook proportional to attribute quality 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 29. Learning - Regularization for i = 1 to ∞ Attributes are often row = readNextRow(); if(margin(weights, row.attributes, threshold) <= 1) correlated weights += (1/n) * row.class * row.attributes; threshold += -row.class; • Contributions cancel out endif • This leads to unreasonably end weights = i/(i+r) * weights; large weights... • ... and models which are not Weight 1 Weight 2 Threshold robust to noise 1000 -399.3 30 Regularization Age Income > threshold • Make sure weights donʼt get too 1000 • 24 + -399.3 • 60 = 42 +1 large • L2 regularization: weights are Age Income BuysBook proportional to attribute quality 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 30. Implementation on Hadoop Map-Reduce • Input data must be in random order • Mapper: send data to reducer in random order • Reducer: run the actual Stochastic Gradient Descent Evaluation and Parameter Selection • Perform several runs with varying parameters • Learn on training set, evaluate on test set • Many runs with with partial data often better than one run with all data © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 31. Demo © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 32. Learning Linear Models Stochastic Gradient Descent: Pros and Cons • One sweep over the data: easy to implement on top of Hadoop • Flexible: support vector machines, logistic regression, etc. • Provides good enough estimate instead of optimum • Parameter selection and evaluation are crucial Alternative: convex optimization • Formulate learning as numerical optimization problem • On Hadoop: usually LBFGS • See Vowpal Wobbit for a large scale implementation © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 33. Conclusion Linear Models • Prediction based on weighted vote and threshold Stochastic Gradient Descent • Adjust weight vector iteratively for each misclassified row • Decreasing step size to ensure convergence • Margins and regularization for robustness Implementation • Mapper provides random order, reducer performs SGD • Evaluation and parameter selection are crucial © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 34. Thanks urueckert@datameer.com © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013