SlideShare a Scribd company logo
Seattle Monthly Hadoop / Scalability /
           NoSQLMeetup

        Ted Dunning, MapR..
Agenda
•   Lightning talks / community announcements
•   Main Speaker
•   Bier @ Feierabend - 422 Yale Ave North
•   Hashtags #Seattle #Hadoop
Fast & Frugal: Running a Lean Startup
   with AWS – Oct 27th 10am-2pm
http://aws.amazon.com/about-aws/events/
Seattle AWS User Group November
         9th, 2011 – 6:30 -9pm
• November we're going to hear from Amy
  Woodward from EngineYard about keeping
  your systems live through outages and other
  problems using EngineYard atop AWS. Come
  check out this great talk and learn a thing or
  three about EngineYard& keeping high
  availability for your systems!
• http://www.nwcloud.org
www.mapr.com
• MapR is an amazing new distributed
  filesystem modeled after Hadoop. It maintains
  API compatibility with Hadoop, but far
  exceeds it in performance, manageability, and
  more.
MapR, Scaling, Machine Learning
Outline
• Philosophy
• Architecture
• Applications
Physics of startup companies
For startups
•   History is always small
•   The future is huge
•   Must adopt new technology to survive
•   Compatibility is not as important
    – In fact, incompatibility is assumed
Physics of large companies

                     Absolute growth
                     still very large




       Startup
       phase
For large businesses
•   Present state is always large
•   Relative growth is much smaller
•   Absolute growth rate can be very large
•   Must adopt new technology to survive
    – Cautiously!
    – But must integrate technology with legacy
• Compatibility is crucial
The startup technology picture
                No compatibility
                 requirement




Old computers
 and software
                                       Expected hardware
                                       and software growth

                   Current computers
                   and software
The large enterprise picture
                    Must work
                    together




    ?
 Current hardware
 and software
                          Proof of concept
                           Hadoop cluster


                                             Long-term Hadoop
                                             cluster
What does this mean?
• Hadoop is very, very good at streaming
  through things in batch jobs
• Hbase is good at persisting data in very write-
  heavy workloads
• Unfortunately, the foundation of both systems
  is HDFS which does not export or import well
Narrow Foundations

                     Big data is         Pig          Hive
            Web Services and
                     heavy
                   expensive to
                         move.


                      Sequential File    Map/
OLAP       OLTP                                       Hbase
                        Processing      Reduce



   RDBMS                   NAS                 HDFS
Narrow Foundations
• Because big data has inertia, it is difficult to
  move
   – It costs time to move
   – It costs reliability because of more moving parts
• The result is many duplicate copies
One Possible Answer
• Widen the foundation
• Use standard communication protocols
• Allow conventional processing to share with
  parallel processing
Broad Foundation

                                        Pig          Hive
            Web Services




                     Sequential File    Map/
OLAP       OLTP                                      Hbase
                       Processing      Reduce



   RDBMS                   NAS                HDFS


                           MapR
Broad Foundation
• Having a broad foundation allows many kinds
  of computation to work together
• It is no longer necessary to throw data over a
  wall
• Performance much higher for map-reduce
• Enterprise grade feature sets such as
  snapshots and mirrors can be integrated
• Operations more familiar to admin staff
Map-Reduce




Input                                          Output
        Map function
                                  Reduce function




                        Shuffle
Map-reduce key details
• User supplies f1 (map) and f2 (reduce)
  – Both are pure functions, no side effect
• Framework supplies input, shuffle, output
• Framework will re-run f1 and f2 on failure
• Redundant task completion is OK
Map-Reduce




Input                Output
Map-Reduce
        f1   Local   f2
             Disk




Input                     Output




        f1   Local   f2
             Disk


        f1
Example – WordCount
• Mapper
  – read line, tokenize into words
  – emit (word, 1)
• Reducer
  – read (word, [k1, … , kn])
  – Emit (word, Σki)
Example – Map Tiles
• Input is set of objects
  – Roads (polyline)
  – Towns (polygon)
  – Lakes (polygon)
• Output is set of map-tiles
  – Graphic image of part of map
Bottlenecks and Issues
• Read-only files
• Many copies in I/O path
• Shuffle based on HTTP
   – Can’t use new technologies
   – Eats file descriptors
• Spills go to local file space
   – Bad for skewed distribution of sizes
MapR Areas of Development

                   HBase    Map
                           Reduce
    Ecosystem


        Storage            Management
        Services
MapR Improvements
• Faster file system
  – Fewer copies
  – Multiple NICS
  – No file descriptor or page-buf competition
• Faster map-reduce
  – Uses distributed file system
  – Direct RPC to receiver
  – Very wide merges
MapR Innovations
• Volumes
  – Distributed management
  – Data placement
• Read/write random access file system
  – Allows distributed meta-data
  – Improved scaling
  – Enables NFS access
• Application-level NIC bonding
• Transactionally correct snapshots and mirrors
MapR'sContainers
              Files/directories are sharded into blocks, which
              are placed into mini NNs (containers ) on disks
                                            Each container contains
                                               Directories & files
                                                Data blocks
                                            Replicated on servers
Containers are 16-
                                            No need to manage
32 GB segments of
                                             directly
disk, placed on
nodes
MapR'sContainers




           Each container has a
            replication chain
           Updates are transactional
           Failures are handled by
            rearranging replication
Container locations and replication

           N1, N2              N1
           N3, N2
           N1, N2
           N1, N3              N2

           N3, N2

    CLDB
                               N3
 Container location database
 (CLDB) keeps track of nodes
 hosting each container and
 replication chain order
MapR Scaling
Containers represent 16 - 32GB of data
      Each can hold up to 1 Billion files and directories
      100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container
      25GB to cache all containers for 2EB cluster
          But not necessary, can page to disk
      Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS block-reports
      Serve 100x more data-nodes
      Increase container size to 64G to serve 4EB cluster
          Map/reduce not affected
MapR's Streaming Performance
        2250                                 2250
                  11 x 7200rpm SATA                                  11 x 15Krpm SAS
        2000                                 2000
        1750                                 1750
        1500                                 1500
        1250                                 1250                              Hardware
                                                                               MapR
        1000                                 1000
MB                                                                             Hadoop
         750                                  750
per
sec      500                                  500
         250                                  250
           0                                    0
                 Read         Write                      Read   Write
                                      Higher is better


      Tests:   i. 16 streams x 120GB        ii. 2000 streams x 1GB
Terasort on MapR
      10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
          60                       300

          50                       250

          40                       200

Elapsed                            150
                                                              MapR
          30
time                                                          Hadoop
(mins)    20                       100

          10                        50


           0                         0
                    1.0 TB                     3.5 TB

                             Lower is better
HBase on MapR
          YCSB Random Read with 1 billion 1K records
          10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
          25000

          20000

Records   15000
  per                                                       MapR
second    10000                                             Apache

          5000

              0
                        Zipfian            Uniform        Higher is better
Small Files (Apache Hadoop, 10 nodes)

                               Out of box
                                                     Op: - create file
Rate (files/sec)




                                                         - write 100 bytes
                                             Tuned       - close
                                                     Notes:
                                                     - NN not replicated
                                                     - NN uses 20G DRAM
                                                     - DN uses 2G DRAM



                            # of files (m)
MUCH faster for some operations
Same 10 nodes …




Create
 Rate




                  # of files (millions)
What MapR is not
• Volumes != federation
  – MapR supports > 10,000 volumes all with
    independent placement and defaults
  – Volumes support snapshots and mirroring
• NFS != FUSE
  – Checksum and compress at gateway
  – IP fail-over
  – Read/write/update semantics at full speed
• MapR != maprfs
Not Your Father’s NFS
• Multiple architectures possible
• Export to the world
  – NFS gateway runs on selected gateway hosts
• Local server
  – NFS gateway runs on local host
  – Enables local compression and check summing
• Export to self
  – NFS gateway runs on all data nodes, mounted
    from localhost
Export to the world


               NFS
                 NFS
              Server
                  NFS
               Server
                    NFS
                 Server
 NFS              Server
Client
Local server

  Application

          NFS
         Server
Client




                  Cluster
                  Nodes
Universal export to self
                     Cluster Nodes




         Task

             NFS
    Cluster Server
    Node
Nodes are identical
     Task
                             Task
         NFS
                                 NFS
Cluster Server
Node                    Cluster Server
                        Node



             Task

                NFS
       Cluster Server
       Node
Application architecture
• High performance map-reduce is nice



• But algorithmic flexibility is even nicer
Sharded textIndex text to local disk
                          Indexing
      Assign documents
          to shards                     and then copy index to
                                         distributed file store




                                                             Clustered
                              Reducer                        index storage
    Input         Map
documents
                     Copy to local disk
                         Local
                 typically disk
                           required before   Local                Search
                   index can be loaded        disk                Engine
Shardedtext indexing
• Mapper assigns document to shard
  – Shard is usually hash of document id
• Reducer indexes all documents for a shard
  – Indexes created on local disk
  – On success, copy index to DFS
  – On failure, delete local files
• Must avoid directory collisions
  – can’t use shard id!
• Must manage and reclaim local disk space
Conventional data flow
                                            Failure of search
                                             engine requires
              Failure of a reducer         another download
               causes garbage to            of the index from
               accumulate in the           clustered storage.
                                                              Clustered
                    local disk   Reducer                      index storage
    Input          Map
documents
                          Local
                           disk              Local                Search
                                              disk                Engine
Simplified NFS data flows


                                                             Search
                                                             Engine
                               Reducer
    Input         Map                    Clustered
documents
                                         index storage
            Failure of a reducer                Search engine
              is cleaned up by                 reads mirrored
                 map-reduce                     index directly.
                 framework
Simplified NFS data flows
                                                 Search
                  Mirroring allows               Engine
                  exact placement
                   of index data



                   Reducer
    Input   Map
documents                                        Search
                                                 Engine
                   Aribitrary levels
                    of replication
                    also possible      Mirrors
How about another one?
K-means
• Classic E-M based algorithm
• Given cluster centroids,
  – Assign each data point to nearest centroid
  – Accumulate new centroids
  – Rinse, lather, repeat
K-means, the movie
         Centroids




I
n         Assign     Aggregate
p           to         new
u        Nearest     centroids
t        centroid
But …
Parallel Stochastic Gradient Descent
              Model




    I
    n
              Train   Average
    p
               sub    models
    u
              model
    t
VariationalDirichlet Assignment
            Model




   I
   n
            Gather      Update
   p
           sufficient   model
   u
           statistics
   t
Old tricks, new dogs
                       Read from local disk
• Mapper               from distributed cache
  – Assign point to cluster
                                          Read from
  – Emit cluster id, (1, point)           HDFS to local disk
• Combiner and reducer                    by distributed cache


  – Sum counts, weighted sum of points
  – Emit cluster id, (n, sum/n)    Written by

• Output to HDFS                                map-reduce
Old tricks, new dogs
• Mapper
  – Assign point to cluster        Read
                                   from
  – Emit cluster id, (1, point)    NFS

• Combiner and reducer
  – Sum counts, weighted sum of points
  – Emit cluster id, (n, sum/n)        Written by
                                          map-reduce
• Output to HDFS
                MapR FS
Poor man’s Pregel
• Mapper
    while not done:
      read and accumulate input models
      for each input:
        accumulate model
      write model
     synchronize
      reset input format
    emit summary

• Lines in bold can use conventional I/O via NFS



                                                   60
Click modeling architecture
        Side-data

                                  Now via NFS




I
         Feature
n                                    Sequential
        extraction     Data
p                                       SGD
           and         join
u                                     Learning
          down
t
        sampling




                     Map-reduce
Click modeling architecture
        Side-data

                         Map-reduce
                         cooperates   Sequential
                          with NFS       SGD
                                       Learning
                                             Sequential
                                                 SGD
I                                             Learning
         Feature
n                                             Sequential
        extraction     Data
p                                                SGD
           and         join
u                                              Learning
          down
t
        sampling                        Sequential
                                           SGD
                                         Learning

                     Map-reduce                  Map-reduce
And another…
Hybrid model flow

Feature extraction
      and                            Down
 down sampling                      stream
                                    modeling
             Map-reduce

                                                 Deployed
                                    Map-reduce    Model
                         SVD
                     (PageRank)
                      (spectral)

                               ??
Hybrid model flow

Feature extraction
      and                           Down
 down sampling                     stream
                                   modeling

                                              Deployed
                                               Model
                         SVD
                     (PageRank)
                      (spectral)

                      Sequential
                      Map-reduce
And visualization…
Trivial visualization interface
• Map-reduce output is visible via NFS
  $R
  > x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”)
  > plot(error ~ t, x)
  > q(save=„n‟)



• Legacy visualization just works
Conclusions
• We used to know all this
• Tab completion used to work
• 5 years of work-arounds have clouded our
  memories

• We just have to remember the future

More Related Content

What's hot

Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Databricks
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
Databricks
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
Databricks
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Databricks
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
DataWorks Summit/Hadoop Summit
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Databricks
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Databricks
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
DataWorks Summit
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Databricks
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
DataWorks Summit/Hadoop Summit
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Spark Summit
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Databricks
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
Spark Summit
 

What's hot (20)

Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 

Viewers also liked

MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
MongoDB
 
Back to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica SetsBack to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica Sets
MongoDB
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 
Webinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessWebinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your Business
MongoDB
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
mcsrivas
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
MongoDB
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
MongoDB
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB
 
Back to Basics: My First MongoDB Application
Back to Basics: My First MongoDB ApplicationBack to Basics: My First MongoDB Application
Back to Basics: My First MongoDB Application
MongoDB
 

Viewers also liked (10)

MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
Back to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica SetsBack to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica Sets
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Webinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessWebinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your Business
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
Back to Basics: My First MongoDB Application
Back to Basics: My First MongoDB ApplicationBack to Basics: My First MongoDB Application
Back to Basics: My First MongoDB Application
 

Similar to Seattle Scalability Meetup - Ted Dunning - MapR

Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither Hadoop
Ed Kohlwey
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
Cisco Canada
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
boorad
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
lucenerevolution
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
Jason Shao
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
saili mane
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
cwensel
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
Federico Cargnelutti
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
TarjeiRomtveit
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
Richard McDougall
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
elliando dias
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFS
bigdatagurus_meetup
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Derek Chen
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
clairvoyantllc
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 

Similar to Seattle Scalability Meetup - Ted Dunning - MapR (20)

Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither Hadoop
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFS
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 

More from clive boulton

Camlistore reprise at Google NYC
Camlistore reprise at Google NYCCamlistore reprise at Google NYC
Camlistore reprise at Google NYC
clive boulton
 
Riak TS
Riak TSRiak TS
Riak TS
clive boulton
 
Ignitepii2014
Ignitepii2014Ignitepii2014
Ignitepii2014
clive boulton
 
Personal databank
Personal databankPersonal databank
Personal databank
clive boulton
 
Seattle Scalability meetup intro slides, Jan 22, 2014
Seattle Scalability meetup intro slides, Jan 22, 2014Seattle Scalability meetup intro slides, Jan 22, 2014
Seattle Scalability meetup intro slides, Jan 22, 2014
clive boulton
 
Seattle Scalability meetup intro slides - Dec 4, 2013 - Scaling SQL + Scaling...
Seattle Scalability meetup intro slides - Dec 4, 2013 - Scaling SQL + Scaling...Seattle Scalability meetup intro slides - Dec 4, 2013 - Scaling SQL + Scaling...
Seattle Scalability meetup intro slides - Dec 4, 2013 - Scaling SQL + Scaling...
clive boulton
 
Seattle scalability meetup intro slides 23 oct 2013
Seattle scalability meetup intro slides 23 oct 2013Seattle scalability meetup intro slides 23 oct 2013
Seattle scalability meetup intro slides 23 oct 2013
clive boulton
 
Seattle scalability meetup intro slides 24 july 2013
Seattle scalability meetup intro slides 24 july 2013Seattle scalability meetup intro slides 24 july 2013
Seattle scalability meetup intro slides 24 july 2013
clive boulton
 
Seattle Scalability Meetup intro pptx - June 26
Seattle Scalability Meetup intro pptx - June 26Seattle Scalability Meetup intro pptx - June 26
Seattle Scalability Meetup intro pptx - June 26
clive boulton
 
Seattle scalability meetup intro ppt May 22
Seattle scalability meetup intro ppt May 22Seattle scalability meetup intro ppt May 22
Seattle scalability meetup intro ppt May 22
clive boulton
 
Patent Trollls gonna kill VRM?
Patent Trollls gonna kill VRM?Patent Trollls gonna kill VRM?
Patent Trollls gonna kill VRM?
clive boulton
 
Seattle scalability meetup March 27,2013 intro slides
Seattle scalability meetup March 27,2013 intro slidesSeattle scalability meetup March 27,2013 intro slides
Seattle scalability meetup March 27,2013 intro slides
clive boulton
 
Seattle scalability meetup intro
Seattle scalability meetup introSeattle scalability meetup intro
Seattle scalability meetup intro
clive boulton
 
Seattle Scalability Meetup | Accumulo and WhitePages
Seattle Scalability Meetup | Accumulo and WhitePagesSeattle Scalability Meetup | Accumulo and WhitePages
Seattle Scalability Meetup | Accumulo and WhitePages
clive boulton
 
Seattle Scalability - Sept Meetup
Seattle Scalability - Sept MeetupSeattle Scalability - Sept Meetup
Seattle Scalability - Sept Meetup
clive boulton
 
Seattle montly hadoop nosql scalability meetup
Seattle montly hadoop nosql scalability meetupSeattle montly hadoop nosql scalability meetup
Seattle montly hadoop nosql scalability meetup
clive boulton
 
Leapfrogging with legacy
Leapfrogging with legacyLeapfrogging with legacy
Leapfrogging with legacy
clive boulton
 
Whole Chain Traceability, pulling a Kobayashi Maru.
Whole Chain Traceability, pulling a Kobayashi Maru. Whole Chain Traceability, pulling a Kobayashi Maru.
Whole Chain Traceability, pulling a Kobayashi Maru.
clive boulton
 
Whole Chain Traceability Consortium
Whole Chain Traceability ConsortiumWhole Chain Traceability Consortium
Whole Chain Traceability Consortium
clive boulton
 
Seattle Scalability - GigaSpaces / Cassandra
Seattle Scalability - GigaSpaces / CassandraSeattle Scalability - GigaSpaces / Cassandra
Seattle Scalability - GigaSpaces / Cassandra
clive boulton
 

More from clive boulton (20)

Camlistore reprise at Google NYC
Camlistore reprise at Google NYCCamlistore reprise at Google NYC
Camlistore reprise at Google NYC
 
Riak TS
Riak TSRiak TS
Riak TS
 
Ignitepii2014
Ignitepii2014Ignitepii2014
Ignitepii2014
 
Personal databank
Personal databankPersonal databank
Personal databank
 
Seattle Scalability meetup intro slides, Jan 22, 2014
Seattle Scalability meetup intro slides, Jan 22, 2014Seattle Scalability meetup intro slides, Jan 22, 2014
Seattle Scalability meetup intro slides, Jan 22, 2014
 
Seattle Scalability meetup intro slides - Dec 4, 2013 - Scaling SQL + Scaling...
Seattle Scalability meetup intro slides - Dec 4, 2013 - Scaling SQL + Scaling...Seattle Scalability meetup intro slides - Dec 4, 2013 - Scaling SQL + Scaling...
Seattle Scalability meetup intro slides - Dec 4, 2013 - Scaling SQL + Scaling...
 
Seattle scalability meetup intro slides 23 oct 2013
Seattle scalability meetup intro slides 23 oct 2013Seattle scalability meetup intro slides 23 oct 2013
Seattle scalability meetup intro slides 23 oct 2013
 
Seattle scalability meetup intro slides 24 july 2013
Seattle scalability meetup intro slides 24 july 2013Seattle scalability meetup intro slides 24 july 2013
Seattle scalability meetup intro slides 24 july 2013
 
Seattle Scalability Meetup intro pptx - June 26
Seattle Scalability Meetup intro pptx - June 26Seattle Scalability Meetup intro pptx - June 26
Seattle Scalability Meetup intro pptx - June 26
 
Seattle scalability meetup intro ppt May 22
Seattle scalability meetup intro ppt May 22Seattle scalability meetup intro ppt May 22
Seattle scalability meetup intro ppt May 22
 
Patent Trollls gonna kill VRM?
Patent Trollls gonna kill VRM?Patent Trollls gonna kill VRM?
Patent Trollls gonna kill VRM?
 
Seattle scalability meetup March 27,2013 intro slides
Seattle scalability meetup March 27,2013 intro slidesSeattle scalability meetup March 27,2013 intro slides
Seattle scalability meetup March 27,2013 intro slides
 
Seattle scalability meetup intro
Seattle scalability meetup introSeattle scalability meetup intro
Seattle scalability meetup intro
 
Seattle Scalability Meetup | Accumulo and WhitePages
Seattle Scalability Meetup | Accumulo and WhitePagesSeattle Scalability Meetup | Accumulo and WhitePages
Seattle Scalability Meetup | Accumulo and WhitePages
 
Seattle Scalability - Sept Meetup
Seattle Scalability - Sept MeetupSeattle Scalability - Sept Meetup
Seattle Scalability - Sept Meetup
 
Seattle montly hadoop nosql scalability meetup
Seattle montly hadoop nosql scalability meetupSeattle montly hadoop nosql scalability meetup
Seattle montly hadoop nosql scalability meetup
 
Leapfrogging with legacy
Leapfrogging with legacyLeapfrogging with legacy
Leapfrogging with legacy
 
Whole Chain Traceability, pulling a Kobayashi Maru.
Whole Chain Traceability, pulling a Kobayashi Maru. Whole Chain Traceability, pulling a Kobayashi Maru.
Whole Chain Traceability, pulling a Kobayashi Maru.
 
Whole Chain Traceability Consortium
Whole Chain Traceability ConsortiumWhole Chain Traceability Consortium
Whole Chain Traceability Consortium
 
Seattle Scalability - GigaSpaces / Cassandra
Seattle Scalability - GigaSpaces / CassandraSeattle Scalability - GigaSpaces / Cassandra
Seattle Scalability - GigaSpaces / Cassandra
 

Recently uploaded

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 

Recently uploaded (20)

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 

Seattle Scalability Meetup - Ted Dunning - MapR

  • 1. Seattle Monthly Hadoop / Scalability / NoSQLMeetup Ted Dunning, MapR..
  • 2. Agenda • Lightning talks / community announcements • Main Speaker • Bier @ Feierabend - 422 Yale Ave North • Hashtags #Seattle #Hadoop
  • 3. Fast & Frugal: Running a Lean Startup with AWS – Oct 27th 10am-2pm http://aws.amazon.com/about-aws/events/
  • 4. Seattle AWS User Group November 9th, 2011 – 6:30 -9pm • November we're going to hear from Amy Woodward from EngineYard about keeping your systems live through outages and other problems using EngineYard atop AWS. Come check out this great talk and learn a thing or three about EngineYard& keeping high availability for your systems! • http://www.nwcloud.org
  • 5. www.mapr.com • MapR is an amazing new distributed filesystem modeled after Hadoop. It maintains API compatibility with Hadoop, but far exceeds it in performance, manageability, and more.
  • 8. Physics of startup companies
  • 9. For startups • History is always small • The future is huge • Must adopt new technology to survive • Compatibility is not as important – In fact, incompatibility is assumed
  • 10. Physics of large companies Absolute growth still very large Startup phase
  • 11. For large businesses • Present state is always large • Relative growth is much smaller • Absolute growth rate can be very large • Must adopt new technology to survive – Cautiously! – But must integrate technology with legacy • Compatibility is crucial
  • 12. The startup technology picture No compatibility requirement Old computers and software Expected hardware and software growth Current computers and software
  • 13. The large enterprise picture Must work together ? Current hardware and software Proof of concept Hadoop cluster Long-term Hadoop cluster
  • 14. What does this mean? • Hadoop is very, very good at streaming through things in batch jobs • Hbase is good at persisting data in very write- heavy workloads • Unfortunately, the foundation of both systems is HDFS which does not export or import well
  • 15. Narrow Foundations Big data is Pig Hive Web Services and heavy expensive to move. Sequential File Map/ OLAP OLTP Hbase Processing Reduce RDBMS NAS HDFS
  • 16. Narrow Foundations • Because big data has inertia, it is difficult to move – It costs time to move – It costs reliability because of more moving parts • The result is many duplicate copies
  • 17. One Possible Answer • Widen the foundation • Use standard communication protocols • Allow conventional processing to share with parallel processing
  • 18. Broad Foundation Pig Hive Web Services Sequential File Map/ OLAP OLTP Hbase Processing Reduce RDBMS NAS HDFS MapR
  • 19. Broad Foundation • Having a broad foundation allows many kinds of computation to work together • It is no longer necessary to throw data over a wall • Performance much higher for map-reduce • Enterprise grade feature sets such as snapshots and mirrors can be integrated • Operations more familiar to admin staff
  • 20. Map-Reduce Input Output Map function Reduce function Shuffle
  • 21. Map-reduce key details • User supplies f1 (map) and f2 (reduce) – Both are pure functions, no side effect • Framework supplies input, shuffle, output • Framework will re-run f1 and f2 on failure • Redundant task completion is OK
  • 23. Map-Reduce f1 Local f2 Disk Input Output f1 Local f2 Disk f1
  • 24. Example – WordCount • Mapper – read line, tokenize into words – emit (word, 1) • Reducer – read (word, [k1, … , kn]) – Emit (word, Σki)
  • 25. Example – Map Tiles • Input is set of objects – Roads (polyline) – Towns (polygon) – Lakes (polygon) • Output is set of map-tiles – Graphic image of part of map
  • 26. Bottlenecks and Issues • Read-only files • Many copies in I/O path • Shuffle based on HTTP – Can’t use new technologies – Eats file descriptors • Spills go to local file space – Bad for skewed distribution of sizes
  • 27. MapR Areas of Development HBase Map Reduce Ecosystem Storage Management Services
  • 28. MapR Improvements • Faster file system – Fewer copies – Multiple NICS – No file descriptor or page-buf competition • Faster map-reduce – Uses distributed file system – Direct RPC to receiver – Very wide merges
  • 29. MapR Innovations • Volumes – Distributed management – Data placement • Read/write random access file system – Allows distributed meta-data – Improved scaling – Enables NFS access • Application-level NIC bonding • Transactionally correct snapshots and mirrors
  • 30. MapR'sContainers Files/directories are sharded into blocks, which are placed into mini NNs (containers ) on disks  Each container contains  Directories & files  Data blocks  Replicated on servers Containers are 16-  No need to manage 32 GB segments of directly disk, placed on nodes
  • 31. MapR'sContainers  Each container has a replication chain  Updates are transactional  Failures are handled by rearranging replication
  • 32. Container locations and replication N1, N2 N1 N3, N2 N1, N2 N1, N3 N2 N3, N2 CLDB N3 Container location database (CLDB) keeps track of nodes hosting each container and replication chain order
  • 33. MapR Scaling Containers represent 16 - 32GB of data  Each can hold up to 1 Billion files and directories  100M containers = ~ 2 Exabytes (a very large cluster) 250 bytes DRAM to cache a container  25GB to cache all containers for 2EB cluster But not necessary, can page to disk  Typical large 10PB cluster needs 2GB Container-reports are 100x - 1000x < HDFS block-reports  Serve 100x more data-nodes  Increase container size to 64G to serve 4EB cluster  Map/reduce not affected
  • 34. MapR's Streaming Performance 2250 2250 11 x 7200rpm SATA 11 x 15Krpm SAS 2000 2000 1750 1750 1500 1500 1250 1250 Hardware MapR 1000 1000 MB Hadoop 750 750 per sec 500 500 250 250 0 0 Read Write Read Write Higher is better Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB
  • 35. Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm 60 300 50 250 40 200 Elapsed 150 MapR 30 time Hadoop (mins) 20 100 10 50 0 0 1.0 TB 3.5 TB Lower is better
  • 36. HBase on MapR YCSB Random Read with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM 25000 20000 Records 15000 per MapR second 10000 Apache 5000 0 Zipfian Uniform Higher is better
  • 37. Small Files (Apache Hadoop, 10 nodes) Out of box Op: - create file Rate (files/sec) - write 100 bytes Tuned - close Notes: - NN not replicated - NN uses 20G DRAM - DN uses 2G DRAM # of files (m)
  • 38. MUCH faster for some operations Same 10 nodes … Create Rate # of files (millions)
  • 39. What MapR is not • Volumes != federation – MapR supports > 10,000 volumes all with independent placement and defaults – Volumes support snapshots and mirroring • NFS != FUSE – Checksum and compress at gateway – IP fail-over – Read/write/update semantics at full speed • MapR != maprfs
  • 40. Not Your Father’s NFS • Multiple architectures possible • Export to the world – NFS gateway runs on selected gateway hosts • Local server – NFS gateway runs on local host – Enables local compression and check summing • Export to self – NFS gateway runs on all data nodes, mounted from localhost
  • 41. Export to the world NFS NFS Server NFS Server NFS Server NFS Server Client
  • 42. Local server Application NFS Server Client Cluster Nodes
  • 43. Universal export to self Cluster Nodes Task NFS Cluster Server Node
  • 44. Nodes are identical Task Task NFS NFS Cluster Server Node Cluster Server Node Task NFS Cluster Server Node
  • 45. Application architecture • High performance map-reduce is nice • But algorithmic flexibility is even nicer
  • 46. Sharded textIndex text to local disk Indexing Assign documents to shards and then copy index to distributed file store Clustered Reducer index storage Input Map documents Copy to local disk Local typically disk required before Local Search index can be loaded disk Engine
  • 47. Shardedtext indexing • Mapper assigns document to shard – Shard is usually hash of document id • Reducer indexes all documents for a shard – Indexes created on local disk – On success, copy index to DFS – On failure, delete local files • Must avoid directory collisions – can’t use shard id! • Must manage and reclaim local disk space
  • 48. Conventional data flow Failure of search engine requires Failure of a reducer another download causes garbage to of the index from accumulate in the clustered storage. Clustered local disk Reducer index storage Input Map documents Local disk Local Search disk Engine
  • 49. Simplified NFS data flows Search Engine Reducer Input Map Clustered documents index storage Failure of a reducer Search engine is cleaned up by reads mirrored map-reduce index directly. framework
  • 50. Simplified NFS data flows Search Mirroring allows Engine exact placement of index data Reducer Input Map documents Search Engine Aribitrary levels of replication also possible Mirrors
  • 52. K-means • Classic E-M based algorithm • Given cluster centroids, – Assign each data point to nearest centroid – Accumulate new centroids – Rinse, lather, repeat
  • 53. K-means, the movie Centroids I n Assign Aggregate p to new u Nearest centroids t centroid
  • 55. Parallel Stochastic Gradient Descent Model I n Train Average p sub models u model t
  • 56. VariationalDirichlet Assignment Model I n Gather Update p sufficient model u statistics t
  • 57. Old tricks, new dogs Read from local disk • Mapper from distributed cache – Assign point to cluster Read from – Emit cluster id, (1, point) HDFS to local disk • Combiner and reducer by distributed cache – Sum counts, weighted sum of points – Emit cluster id, (n, sum/n) Written by • Output to HDFS map-reduce
  • 58. Old tricks, new dogs • Mapper – Assign point to cluster Read from – Emit cluster id, (1, point) NFS • Combiner and reducer – Sum counts, weighted sum of points – Emit cluster id, (n, sum/n) Written by map-reduce • Output to HDFS MapR FS
  • 59. Poor man’s Pregel • Mapper while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input format emit summary • Lines in bold can use conventional I/O via NFS 60
  • 60. Click modeling architecture Side-data Now via NFS I Feature n Sequential extraction Data p SGD and join u Learning down t sampling Map-reduce
  • 61. Click modeling architecture Side-data Map-reduce cooperates Sequential with NFS SGD Learning Sequential SGD I Learning Feature n Sequential extraction Data p SGD and join u Learning down t sampling Sequential SGD Learning Map-reduce Map-reduce
  • 63. Hybrid model flow Feature extraction and Down down sampling stream modeling Map-reduce Deployed Map-reduce Model SVD (PageRank) (spectral) ??
  • 64.
  • 65. Hybrid model flow Feature extraction and Down down sampling stream modeling Deployed Model SVD (PageRank) (spectral) Sequential Map-reduce
  • 67. Trivial visualization interface • Map-reduce output is visible via NFS $R > x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”) > plot(error ~ t, x) > q(save=„n‟) • Legacy visualization just works
  • 68. Conclusions • We used to know all this • Tab completion used to work • 5 years of work-arounds have clouded our memories • We just have to remember the future

Editor's Notes

  1. Constant time implies constantfactor of growth. Thus the accumulation of all of history before 10 time units ago is less than half the accumulation in the last 10 units alone. This is true at all time.
  2. Startups use this fact to their advantage and completely change everything to allow time-efficient development initially with conversion to computer-efficient systems later.
  3. Here the later history is shown after the initial exponential growth phase. This changes the economics of the company dramatically.
  4. The startup can throw away history because it is so small. That means that the startup has almost no compatibility requirement because the data lost due to lack of compatibility is a small fraction of the total data.
  5. A large enterprise cannot do that. They have to have access to the old data and have to share between old data and Hadoop accessible data.This doesn’t have to happen with the proof of concept level, but it really must happen when hadoop first goes to production.
  6. But stock Hadoop does not handle this well.
  7. This is because Hadoop and other data silos have different foundations. What is worse, there is a semantic wall that separates HDFS from normal resources.
  8. Here is a picture that shows how MapR can replace the foundation and provide compatibility. Of course, MapR provide much more than just the base, but the foundation is what provides the fundamental limitation or lack of limit in MapR’s case.