SlideShare a Scribd company logo
Scaling Big Data
Mining Infrastructure


            Jimmy Lin   Dmitriy Ryaboy
            @lintool    @squarecog

            Hadoop Summit Europe
            Thursday, March 21, 2013
From the Ivory Tower…

Source: Wikipedia (All Souls College, Oxford)
… to building sh*t that works.
Source: Wikipedia (Factory)
IMHO
     Represents personal opinion. Not official position of Twitter.
Management not responsible for misuse. Void where prohibited. YMMV.




              (If someone asks, I probably wasn’t here)
“Yesterday”



        ~150 people total
       ~60 Hadoop nodes
~6 people use analytics stack daily
“Today”


           ~1400 people total
10s of Ks of Hadoop nodes, multiple DCs
 10s of PBs total Hadoop DW capacity
          ~100 TB ingest daily
   dozens of teams use Hadoop daily
     10s of Ks of Hadoop jobs daily
Why?


         actionable insights
data     data products
Big data mining
Cool, you get to work on new algorithms!



        No, not really…
Big data mining isn’t mainly
 about data mining per se!
It’s impossible to overstress this: 80%
                                of the work in any data project is in
                                 cleaning the data. – DJ Patil ―Data
                                               Jujitsu‖




Source: Wikipedia (Jujitsu)
Reality
 Your boss says something vague
 You think very hard on how to move the needle
 Where’s the data?
 What’s in this dataset?
 What’s all the f#$!* crap in the data?
 Clean the data
 Run some off-the-shelf data mining algorithm
 …
 Productionize, act on the insight
 Rinse, repeat
Data science is less
     glamorous that you think!


How do we make data scientists’ lives a bit easier?
Gathering
Source: Wikipedia (Logging)
Moving




Source: Wikipedia (Timber rafting)
Organizing




Source: Wikipedia (Logging)
Log directly into a database!

Source: http://www.flickr.com/photos/snukkel/3206405352/
create table `my_audit_log` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `created_at` datetime,
  `user_id` int(11),
  `action` varchar(256),
  ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


           Don’t do this!
             —   Workload mismatch
             —   Scaling challenges
             —   Overkill
             —   Schema changes
Main Datacenter


                                Scribe
                              Aggregators



                                                HDFS                      Main Hadoop
                                                                              DW

                                   Staging Hadoop Cluster
Datacenter
         Scribe Daemons                     Datacenter
        (Production Hosts)
                           Scribe
                         Aggregators                                   Scribe
                                                                     Aggregators

                                            HDFS
                                                                                      HDFS

                              Staging Hadoop Cluster
                                                                          Staging Hadoop Cluster
     Scribe Daemons
    (Production Hosts)                           Scribe Daemons
                                                (Production Hosts)




                             Use Scribe. or Flume.or Kafka.
Scribe solves log transport only…




        System.out.println
             LOG.info
^(w+s+d+s+d+:d+:d+)s+
([^@]+?)@(S+)s+(S+):s+(S+)s+(S+)
s+((?:S+?,s+)*(?:S+?))s+(S+)s+(S+)
s+[([^]]+)]s+"(w+)s+([^"]*
(?:.[^"]*)*)s+(S+)"s+(S+)s+
(S+)s+"([^"]*(?:.[^"]*)*)
"s+"([^"]*(?:.[^"]*)*)"s*
(d*-[d-]*)?s*(d+)?s*(d*.[d.]*)?
(s+[-w]+)?.*$

  An actual Java regular expression used to
  parse log message at Twitter circa 2010

     Plain-text log messages suck
             Don’t do this!
userid
CamelCase

smallCamelCase    user_id

snake_case

camel_Snake

dunder__snake




             Naming things is hard!
JSON to the Rescue!

Source: http://www.flickr.com/photos/snukkel/3206405352/
This should really be a list…
Remember the camelSnake!
        {
            "token": 945842,
            "feature_enabled": "super_special",
            "userid": 229922,
            "page": "null",             Is this really an integer?
            "info": { "email": "my@place.com" }
        }


                         Is this really null?

    What keys? What values?


                 This does not scale.
struct MessageInfo {
   1: optional string name
   2: optional string email
 }

struct LogMessage {
   1: required i64 token
   2: required string user_id
   3: optional list<Feature> enabled_features
   4: optional i64 page = 0
   5: optional MessageInfo info
 }
                              + DDL provides type safety
 enum Feature {
   super_special,             + Auto codgen
   less_special
 }                            + Efficient serialization

                              +   Sane schema migration
                              +   Separate logical from
                                  physical
            Use Thrift. or Protobufs.or Avro.
Schemas aren’t enough!


We need a data discovery service!
          Where’s the data?
          How do I read it?
          Who produces it?
         Who consumes it?
      When was it last generated?
                  …
Where to find data?

 Old way:
                Hard-coded partitioning scheme, path, format
  A = LOAD ‘/tables/statuses/2011/01/{05,06,07}/*.lzo’
      USING LzoStatusProtobufBlockPigLoader();
                                          Custom loader

 How do people know?            1.) Ask around 2.) Cargo-cult

 New way:
                   Nice UI for browsing
  A = LOAD ‘tables.statuses’
      USING TwadoopLoader(); Same loader each time

  B = FILTER A BY year == ‘2011’
      AND month == ‘11’
      AND day == ‘01’
      AND hour >= ’05’
      AND hour <= ‘07’;
                         Filters are pushed into the loader.
                            No need to understand partitioning
                            scheme.
Data Access Layer (DAL)
―All problems in computer science can be solved by another level
of indirection... Except for the problem of too many layers of
indirection.‖      – David Wheeler


        All data accesses go through DAL
           Thin layer on top of HCatalog
Data Access Layer (DAL)
          Who wrote what data, when?


                      #win
Automatically construct data/job dependency graph
        Automatically figure out ownership

       Hooks into alerting, auditing, repos,
              deploy systems, etc.
Plumbing
                               Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter.
                               SIGMOD 2012.
Source: Wikipedia (Plumbing)
Classification




Source: Wikipedia (Sorting)
label
Given
                    feature vector



                                       ?   features


    features
     features
      features      learner                    classifier
       features
        features



Induce                                            ?   ?




                      Training       Predicting
Stone age machine learning…




Source: Wikipedia (Stonehenge)
upload results


data munging
 Joining multiple dataset
 Feature extraction
 …                                  download
                      down-sample   test data



                            train    predict
What doesn’t work…
 1.   Down-sampling for training on single-processor
         Defeats the whole point of big data!
 2.   Ad hoc productionizing
         Disconnected from rest of production workflow
Production considerations:
    dependency management
           scheduling
       resource allocation
           monitoring
         error reporting
             alerting
                …


        We need…
Seamless scaling




Source: Wikipedia (Galaxy)
Integration with production workflows




Source: Wikipedia (Oil refinery)
Stochastic Gradient Descent
             Conceptually, classifier training is a like
               user-defined aggregate function!


             AVG                     SGD

initialize   sum = 0                 initialize
             count = 0               weights

 update      add to sum
             increment
             count
terminate    return sum / count      return weights
previous Pig dataflow                previous Pig dataflow



             map

Classifier
Training
                    reduce
                                label, feature vector
             Pig storage
                function

                           model                           model     model



                                    feature vector                 feature vector
  Making           model        UDF              model         UDF
Predictions                         prediction                     prediction
                             Just like any other parallel Pig dataflow
It’s just Pig!




                 For ―free‖: dependency management,
                 scheduling, resource allocation,
                 monitoring, error reporting, alerting, …
Source: Wikipedia (Road surface)
Takeaway messages
How do we make data scientists’ lives a bit easier?




     Adding a bit of structure goes a long way
Getting the plumbing right makes all the difference




 ―In theory, there is no difference between
 theory and practice. But, in practice, there
 is.‖
                                                Questions?
                 - Jan L.A. van de Snepscheut

More Related Content

What's hot

Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
Databricks
 
Wizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureWizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in Azure
Databricks
 
Mixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache SparkMixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache Spark
VMware Tanzu
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Saurav Haloi
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
Databricks
 
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
SANG WON PARK
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra
 
Graph Databases for Master Data Management
Graph Databases for Master Data ManagementGraph Databases for Master Data Management
Graph Databases for Master Data Management
Neo4j
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
Eva Tse
 
FIBO in Neo4j: Applying Knowledge Graphs in the Financial Industry
FIBO in Neo4j: Applying Knowledge Graphs in the Financial IndustryFIBO in Neo4j: Applying Knowledge Graphs in the Financial Industry
FIBO in Neo4j: Applying Knowledge Graphs in the Financial Industry
Neo4j
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
Taras Matyashovsky
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
DataWorks Summit
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
John Yeung
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
Dmitry Anoshin
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 

What's hot (20)

Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Wizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureWizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in Azure
 
Mixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache SparkMixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache Spark
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Graph Databases for Master Data Management
Graph Databases for Master Data ManagementGraph Databases for Master Data Management
Graph Databases for Master Data Management
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
 
FIBO in Neo4j: Applying Knowledge Graphs in the Financial Industry
FIBO in Neo4j: Applying Knowledge Graphs in the Financial IndustryFIBO in Neo4j: Applying Knowledge Graphs in the Financial Industry
FIBO in Neo4j: Applying Knowledge Graphs in the Financial Industry
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 

Viewers also liked

Composing and scaling data platforms
Composing and scaling data platformsComposing and scaling data platforms
Composing and scaling data platforms
Sigmoid
 
To Hell With Good Intentions: Linked Data and the Power to Name
To Hell With Good Intentions: Linked Data and the Power to NameTo Hell With Good Intentions: Linked Data and the Power to Name
To Hell With Good Intentions: Linked Data and the Power to Name
Mark Matienzo
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
DataWorks Summit
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Hortonworks
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterBill Graham
 

Viewers also liked (6)

Composing and scaling data platforms
Composing and scaling data platformsComposing and scaling data platforms
Composing and scaling data platforms
 
To Hell With Good Intentions: Linked Data and the Power to Name
To Hell With Good Intentions: Linked Data and the Power to NameTo Hell With Good Intentions: Linked Data and the Power to Name
To Hell With Good Intentions: Linked Data and the Power to Name
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
 

Similar to Scaling Big Data Mining Infrastructure Twitter Experience

Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
Cloudera, Inc.
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
Gert Drapers
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking systemJesse Vincent
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
Denny Lee
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
Jared Winick
 
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemFour Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Treasure Data, Inc.
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
fvanvollenhoven
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Richard McDougall
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys Admins
Puppet
 
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
Zhenzhong Xu
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
cwensel
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
outstanding59
 

Similar to Scaling Big Data Mining Infrastructure Twitter Experience (20)

Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking system
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemFour Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys Admins
 
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Django at Scale
Django at ScaleDjango at Scale
Django at Scale
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 

Recently uploaded (20)

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 

Scaling Big Data Mining Infrastructure Twitter Experience

  • 1. Scaling Big Data Mining Infrastructure Jimmy Lin Dmitriy Ryaboy @lintool @squarecog Hadoop Summit Europe Thursday, March 21, 2013
  • 2. From the Ivory Tower… Source: Wikipedia (All Souls College, Oxford)
  • 3. … to building sh*t that works. Source: Wikipedia (Factory)
  • 4. IMHO Represents personal opinion. Not official position of Twitter. Management not responsible for misuse. Void where prohibited. YMMV. (If someone asks, I probably wasn’t here)
  • 5.
  • 6. “Yesterday” ~150 people total ~60 Hadoop nodes ~6 people use analytics stack daily
  • 7. “Today” ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total Hadoop DW capacity ~100 TB ingest daily dozens of teams use Hadoop daily 10s of Ks of Hadoop jobs daily
  • 8. Why? actionable insights data data products
  • 9. Big data mining Cool, you get to work on new algorithms! No, not really…
  • 10. Big data mining isn’t mainly about data mining per se!
  • 11. It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data. – DJ Patil ―Data Jujitsu‖ Source: Wikipedia (Jujitsu)
  • 12. Reality Your boss says something vague You think very hard on how to move the needle Where’s the data? What’s in this dataset? What’s all the f#$!* crap in the data? Clean the data Run some off-the-shelf data mining algorithm … Productionize, act on the insight Rinse, repeat
  • 13. Data science is less glamorous that you think! How do we make data scientists’ lives a bit easier?
  • 14.
  • 18. Log directly into a database! Source: http://www.flickr.com/photos/snukkel/3206405352/
  • 19. create table `my_audit_log` ( `id` int(11) NOT NULL AUTO_INCREMENT, `created_at` datetime, `user_id` int(11), `action` varchar(256), ... ) ENGINE=InnoDB DEFAULT CHARSET=utf8; Don’t do this! — Workload mismatch — Scaling challenges — Overkill — Schema changes
  • 20. Main Datacenter Scribe Aggregators HDFS Main Hadoop DW Staging Hadoop Cluster Datacenter Scribe Daemons Datacenter (Production Hosts) Scribe Aggregators Scribe Aggregators HDFS HDFS Staging Hadoop Cluster Staging Hadoop Cluster Scribe Daemons (Production Hosts) Scribe Daemons (Production Hosts) Use Scribe. or Flume.or Kafka.
  • 21. Scribe solves log transport only… System.out.println LOG.info
  • 23. userid CamelCase smallCamelCase user_id snake_case camel_Snake dunder__snake Naming things is hard!
  • 24. JSON to the Rescue! Source: http://www.flickr.com/photos/snukkel/3206405352/
  • 25. This should really be a list… Remember the camelSnake! { "token": 945842, "feature_enabled": "super_special", "userid": 229922, "page": "null", Is this really an integer? "info": { "email": "my@place.com" } } Is this really null? What keys? What values? This does not scale.
  • 26. struct MessageInfo { 1: optional string name 2: optional string email } struct LogMessage { 1: required i64 token 2: required string user_id 3: optional list<Feature> enabled_features 4: optional i64 page = 0 5: optional MessageInfo info } + DDL provides type safety enum Feature { super_special, + Auto codgen less_special } + Efficient serialization + Sane schema migration + Separate logical from physical Use Thrift. or Protobufs.or Avro.
  • 27. Schemas aren’t enough! We need a data discovery service! Where’s the data? How do I read it? Who produces it? Who consumes it? When was it last generated? …
  • 28. Where to find data? Old way: Hard-coded partitioning scheme, path, format A = LOAD ‘/tables/statuses/2011/01/{05,06,07}/*.lzo’ USING LzoStatusProtobufBlockPigLoader(); Custom loader How do people know? 1.) Ask around 2.) Cargo-cult New way: Nice UI for browsing A = LOAD ‘tables.statuses’ USING TwadoopLoader(); Same loader each time B = FILTER A BY year == ‘2011’ AND month == ‘11’ AND day == ‘01’ AND hour >= ’05’ AND hour <= ‘07’; Filters are pushed into the loader. No need to understand partitioning scheme.
  • 29. Data Access Layer (DAL) ―All problems in computer science can be solved by another level of indirection... Except for the problem of too many layers of indirection.‖ – David Wheeler All data accesses go through DAL Thin layer on top of HCatalog
  • 30. Data Access Layer (DAL) Who wrote what data, when? #win Automatically construct data/job dependency graph Automatically figure out ownership Hooks into alerting, auditing, repos, deploy systems, etc.
  • 31. Plumbing Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter. SIGMOD 2012. Source: Wikipedia (Plumbing)
  • 33. label Given feature vector ? features features features features learner classifier features features Induce ? ? Training Predicting
  • 34. Stone age machine learning… Source: Wikipedia (Stonehenge)
  • 35. upload results data munging Joining multiple dataset Feature extraction … download down-sample test data train predict
  • 36. What doesn’t work… 1. Down-sampling for training on single-processor  Defeats the whole point of big data! 2. Ad hoc productionizing  Disconnected from rest of production workflow
  • 37. Production considerations: dependency management scheduling resource allocation monitoring error reporting alerting … We need…
  • 39. Integration with production workflows Source: Wikipedia (Oil refinery)
  • 40. Stochastic Gradient Descent Conceptually, classifier training is a like user-defined aggregate function! AVG SGD initialize sum = 0 initialize count = 0 weights update add to sum increment count terminate return sum / count return weights
  • 41. previous Pig dataflow previous Pig dataflow map Classifier Training reduce label, feature vector Pig storage function model model model feature vector feature vector Making model UDF model UDF Predictions prediction prediction Just like any other parallel Pig dataflow
  • 42. It’s just Pig! For ―free‖: dependency management, scheduling, resource allocation, monitoring, error reporting, alerting, …
  • 44. Takeaway messages How do we make data scientists’ lives a bit easier? Adding a bit of structure goes a long way Getting the plumbing right makes all the difference ―In theory, there is no difference between theory and practice. But, in practice, there is.‖ Questions? - Jan L.A. van de Snepscheut