Your SlideShare is downloading. ×
  • Like
Scalable, Distributed, Machine Learning for Big Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Scalable, Distributed, Machine Learning for Big Data

  • 12,198 views
Published

Big Data, Parallel Computing, Cloud computing, Lambda architecture, MapReduce, GFS, Hadoop, HFS, Precolator, Caffeine, Pregel, Drill, chukwa, Hive, Pig, Scribe, Flume, Thrift, YARN, Storm, …

Big Data, Parallel Computing, Cloud computing, Lambda architecture, MapReduce, GFS, Hadoop, HFS, Precolator, Caffeine, Pregel, Drill, chukwa, Hive, Pig, Scribe, Flume, Thrift, YARN, Storm, Summingbird, S4, ZooKeeper, Data Freeway, Puma1/2/3, NoSQL, BigTable, Dynamo, Cassandra, HBase, Kafka, Samza, Large Scale Machine Learning, overfitting, curse of dimensionality, load balancing, auto scaling, job scheduling, work flow, Spark, Mahout, Jubatus, Graphlab, BSP, Dremel, Giraph, Hama, Vowpal Wabbit, Strident ML, Storm-Pattern, Samoa.

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
12,198
On SlideShare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
512
Comments
0
Likes
45

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Scalable, Distributed, Machine Learning for Big Data Yu Huang Sunnyvale, California yu.huang07@gmail.com
  • 2. Outline  Big Data - Volume, Variety, Velocity  Parallel Computing and Cloud computing  Lambda architecture: Batch, Speed and Serving Layers ◦ Hadoop: MR implementation from Yahoo ◦ Apache Thrift: scalable cross-language services from Facebook ◦ Chukwa: data collection system ◦ Apache Flume: stream data collection ◦ Hive: data warehouse ◦ Pig: high level data-flow language ◦ Zookeeper: high-performance coordination service ◦ YARN (MRv2 or next gen Hadoop) ◦ Summingbird: a lib to write MR programs on MR at Twitter ◦ Storm: Stream processing from Twitter ◦ S4: Stream processing from Yahoo ◦ Scribe: server for stream data aggregating at Facebook ◦ Data Freeway: data stream at Facebook ◦ Puma: Stream processing from Facebook ◦ Kafka: distributed messaging system at Linkedin ◦ Samza: stream processing from LinkedIn ◦ Kinesis: real-time stream processing at Amazon
  • 3. Outline  NoSQL-Not Only SQL database ◦ Google Bigtable ◦ Amazon Dynamo ◦ Cassandra by Facebook ◦ Hbase: like Bigtable  New SQL: ◦ Google Spanner  Large Scale Machine Learning ◦ Spark – Lightning-Fast Cluster Computing ◦ Mahout - Scalable ML on Hadoop ◦ Jubatus – Distributed Online Real-time ML ◦ Graphlab – Big Learning on Graphs at UC Berkeley ◦ Vowpal Wabbit – Fast Learning at Yahoo/MS ◦ Trident ML and Storm Pattern: ML on Storm, YARN ◦ Upcoming --- Samoa: ML on S4, Storm  Key Issues in Scalable Distributed ML for Big Data ◦ Load balancing ◦ Auto scaling ◦ Job Scheduling ◦ Workflow management  Reference
  • 4. Big Data  Volume (large amounts of data gathered);  Variety (various degrees of structure);  Velocity (how data flow, at high rates);  Value (business);  Variability (changes);  Veracity (quality).  Data as a Service (DaaS) in the cloud;  Two main strategies for dealing with big data: ◦ Sampling; ◦ Distributed systems.  Big Data Challenges ◦ Protecting privacy; ◦ Integration of big data technologies into enterprise landscape; ◦ Addressing increasing real time needs with increasing data volume and varieties; ◦ Leveraging cloud computing with big data storage and processing.
  • 5. Big Data Instances  One billion data instance ◦ Web-scale ◦ Guaranteed to contain data in different formats  ASCII text, pictures, javascript code, PDF documents… ◦ Guaranteed to contain (near) duplicates ◦ Likely to be badly preprocessed  ◦ Storage is an issue  One trillion data instance ◦ Beyond the reach of the modern technology ◦ Peer-to-peer paradigm is (arguably) the only way to process the data ◦ Data privacy / inconsistency / skewness issues  Can’t be kept in one location  Is intrinsically hard to sample
  • 6. Big Data Analysis Pipeline
  • 7. Parallel Computing  Data-Instruction; ◦ SIMD, MIMD, …  Data intensive ◦ Cloud computing  Compute intensive ◦ GPU computing  Shared memory: OpenMP  Distributed memory: MPI,  Hybrid: MR
  • 8. Cloud Computing  A model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (Ethernet), usually for large Internet services;  Dynamic provision of services & resource pools in a coordinated fashion;  Cloud computing infrastructure is just a web service interface to operating system virtualization (via hypervisor);  Heterogeneous by virtualization;  Everything as a service (XaaS);  Data intensive: big data;  Distributed parallel, more like utility computing;  Not grid computing.
  • 9. X-as-a-Service
  • 10. Lambda Architecture  Equation “query = function(all data)” which is the basis of all data systems (data is more than information);  Human fault-tolerance – the system is unsusceptible to data loss or data corruption  Data immutability – store data in it’s rawest form immutable and for perpetuity.  Re-computation – with the two principles above it is always possible to (re)-compute results  Layered structure: ◦ Batch layer: unrestrained batch compute, horizontal scalable, high latency, read-only database, raw dataset, override speed layer (like Hadoop); ◦ Speed layer: only new data, stream processing, continuous compute, transactional, limited storage of windowed data (such as Storm); ◦ Serving layer: query batch views by load and random access.  Can discard any view, batch and real time, and just recreate everything from the master data.  Mistakes are corrected via recomputation. ◦ Write bad data? Remove the data & recompute. ◦ Bug in view generation? Just recompute the view.  Data storage is highly optimized.
  • 11. Lambda Architecture Flowchart
  • 12. Data Analytics System Architecture Online transaction processing Facebook Apache Apache
  • 13. Fault Tolerance in the Data Stream System for High Availability
  • 14. Interaction Model in the Stream Processing System (Push/Pull)
  • 15. Map-Reduce  A program model borrowed from functional programming  Separate details of the original problem from of parallelism; ◦ map() produces one or more intermediate (key/value pairs) from the split input (“shards”); ◦ reduce() combines intermediate (key/value pairs) into final files after partitioning and sorting by key;  Scale to a large cluster of machines from a single machine;  Fault tolerance: Map or Reduce;  Locality: Distributed GFS chunks;  Bottleneck: The Reduce phase can’t start until the Map phase is completely finished (batch, not stream, processing): ◦May not suitable for real time processing and in-depth analysis.
  • 16. Map-Reduce Pipelinee
  • 17. Hadoop  HDFS: data storage and transfer, as GFS in Hadoop; ◦ NamedNode (job tracker), DataNode (task tracker); ◦ Master Node, Slave Node; ◦ Error handling: replication (3 by default);  Job Tracker: scheduling, JobConf and JobClient;  Task Tracker: status, TaskRunner, map or reduce;  Data In/Out: ◦ HDFS block size in Input Splits ◦ # of reducers in Output;  Task Failure: report;  Job Scheduler: ◦ FIFO, Fair, Capacity,…
  • 18. Map-Reduce in Hadoop JobTracker MapReduce job submitted by client computer Master node TaskTracker Slave node Task instance TaskTracker Slave node Task instance TaskTracker Slave node Task instance
  • 19. Hive & Pig  Hive: A database/warehouse on top of Hadoop; ◦ SQL as a familiar data warehousing tool ◦ Extensibility – Types, Functions, Formats, Scripts ◦ Scalability and Performance; ◦ Rich data types (structs, lists and maps); ◦ Efficient impl.s of SQL filters, joins and group-by’s on top of MR; ◦ Easy interactions with different programming languages;  HQL is like SQL.  Pig: A platform for easier analyzing large data sets; ◦ Pig Latin: data flow language similar to scripting languages; ◦ Pig Engine: parses, optimizes and automatically executes Pig Latin scripts; ◦ User defined functions for col. transform (TOUPPER), or aggregation (SUM);  UDFs to take advantage of the combiner. ◦ Four join implement.s built in: hash, fragment-replicate, merge, skewed; ◦ Writing load and store functions is easy once an I/O Format exist; ◦ Piggybank - a collection of user contributed UDFs; ◦ DataFu - LinkedIn's collection of Pig UDFs.
  • 20. Apache Thrift  Software framework for scalable cross-language services, at Facebook;  A software stack + a code generation engine to build services b.t.w. C++, Java, Python, PHP, Ruby, Erlang, Perl, C#, OCaml and Delphi, etc.;  Key components in this open source: ◦ Type: for users to develop using completely natively defined types; ◦ Transport: used by the generated code to facilitate data transfer; ◦ Protocl: a certain messaging structure in data transport, agnostic to encoding; ◦ Versioning: staged rollouts of changes to deployed services; ◦ RPC implementation: TProcessor instance for data stream processing to realize remote procedure calls (RPC), and TServer abstraction;  The interface definition language (IDL) allows for definition of Types: ◦ A Thrift IDL file is processed by the code generator to produce code for the target languages to support the defined structs and services in the IDL file.  Similar systems ◦ SOAP. Designed for web services via HTTP, excessive XML parsing overhead; ◦ CORBA. Relatively comprehensive, debatably overdesigned and heavyweight; ◦ Avro. Dynamic typing, untagged data, no manually-assigned field IDs; ◦ COM. Embraced mainly in Windows client software. Not entirely open solution; ◦ Pillar. Lightweight and high-performance, but missing versioning & abstraction; ◦ Protocol Buffers. Closed-source, owned by Google.
  • 21. Apache Thrift
  • 22. Apache Thrift The Thrift stack is a common class hierarchy implemented in each language that abstracts out the tricky details of protocol encoding and network communication
  • 23. Chukwa  A data collection system for monitoring large distributed systems;  Provides flexible/powerful toolkit to display, monitor, and analyze results;  Architecture: ◦ Agents - run on each machine and emit data; ◦ Collectors - receive data from the agent and write it to stable storage; ◦ MapReduce jobs - parsing and archiving the data; ◦ Hadoop Infrastructure Care Center - a web-portal style interface.
  • 24. ZooKeeper in Hadoop  A shared hierarchical name space of data registers;  Exposes common services in simple interface: ◦ Naming, configuration management, locks & synchronization, group membership services, leader selection;  Each node in the namespace is called as a ZNode. ◦ Persistent Nodes, Ephemeral Nodes, Sequence Nodes; ◦ Every ZNode has data and can optionally have children; ◦ Read requests are processed locally at the server; ◦ Write requests are forwarded to the leader.  ZNode paths: every ZNode exists at some path ◦ Canonical, absolute, slash-separated; ◦ No relative references; ◦ Names can have Unicode characters.  ZNode watches can be set on ZNodes ◦ One time change triggers, always ordered.  A client connects to ZooKeeper and initiates a session;  Consistency guarantees;  Support Kerberos security.
  • 25. ZooKeeper Service  ZooKeeper service is replicated over a set of machines;  All machines store a copy of the data (in memory);  A leader is elected on service startup;  Clients only connect to a single ZooKeeper server & maintains a TCP connection;  Client can read from any Zookeeper server, writes go through the leader & needs majority consensus.
  • 26. Precolator  Describes how web search index is kept up to date, at google; ◦ Google’s indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines;  Incremental update to big data: code in Java, no need for batch process;  Provide transaction/locking, based on GFS, build on top of BigTable;  Architecture: ◦ Applications are a sequence of observers; ◦ A observer is called via notification; ◦ A notification is triggered when table data changes; ◦ Applications call BigTable’s TableServers via RPC; ◦ TableServers call GFS ChunkServer;
  • 27. Precolator  Random accesses to the document repository while maintaining data invariants;  Faster than comparable MapReduce, improved latency (100x), , reduce the doc’s average age by 50%;  Time stamping and locking via Chubby lock server;
  • 28. Caffeine  Caffeine is a new search scheme (algorithm) based on Precolator;  Even with changes, most white hat optimization tactics continue to prevail;  More competition for single, generic-type keywords, less stability of rankings, and increased focus on long-tail keywords in SEO;  Feature site titles and snippets with higher phrase/keyword density.  Faster index: returned at faster speeds  Fresher results: more current, such as blog posts the last few days.  More emphasis on social media, like Facebook, LinkedIn, Blogger, etc.  Less emphasis on universal search. ◦ Lower on the page to make paid search more visible.  Increased prominence of video. ◦ As prominently featuring video listings.  Keywords in domain name. ◦ Do weigh keyword domain names even higher. ◦ For a new site, a microsite with your keywords embedded within the URL.  “organize the world's information, make it universally accessible and useful”.
  • 29. Panda/Farmer Update  SEO (Search Engine Optimization): ‘crawl’ the web (spider?); create this page Index and the Quality Team; the Spam Team throwing away stuff in the Index that shouldn’t be there (by the Crawl Team as well).  Google Mayday update: degrade lower-quality websites, place more weight on quality signals, lowering weight of textual relevancy signals. ◦ Anti-spam and user behavior.  Google Farmer (renamed as Panda) update: hurt “Content Farms”, or sites that contain huge amounts of content of poor quality in order to rank on as high number of keyword combinations as possible; ◦ Placing the emphasis on user experience (average time spent on the site/specific page, bounce rate, Click Through Rate etc. ) ◦ The social trend - “+1” buttons was added near each result; ◦ Personalized Search - The changes in results between users could arise from geographic differences, daytime changes; If the user is logged in to Google account the results would be adjusted even further since Google’s servers collect information about the user and his browsing habits.
  • 30. Panda/Farmer Update
  • 31. Dremel  Scalable, interactive ad-hoc query system for analysis of nested data; ◦ multi-level execution trees and SQL-like language to express ad hoc queries ◦ column-striped storage representation of nested data  BigQuery, interactive query service as external implement. of Dremel; ◦ Hive and Pig are slow  Data model is based on strongly-typed nested records  Tablet Storage and Horizontal Partitioning to save space  Levels are packed as bit sequence;  Queries based on their priorities and load balances with fault tolerance ◦ Slots and histograms ◦ Handles stragglers ◦ Tablets are three-way replicated  Interoperates with Google's data management tools ◦ In situ data access (e.g., GFS, Bigtable) ◦ MapReduce pipelines
  • 32. Dremel schema records Repetition Level Definition Level
  • 33. Apache Drill • Open source Implementation of Google BigQuery • Flexibility: broader range of query languages  Fast ◦ Low latency queries ◦ Columnar execution: like google dremel ◦ Complement native interfaces and MapReduce/Hive/Pig  Open ◦ Community driven open source project ◦ Under Apache Software Foundation  Modern ◦ Standard ANSI SQL:2003 (select/into) ◦ Nested/hierarchical data support ◦ Schema is optional  Query any HBase, Cassandra or MongoDB table ◦ Supports RDBMS, Hadoop and NoSQL  DrQL: SQL-like query language  Mongo Query Language
  • 34. YARN  Yet Another Resource Negotiator: MRv2 (Next Gen. Hadoop); ◦ Predictable Latency – A major customer concern; ◦ Support for alternate programming paradigms to MR.  Separate the tasks of Job Tracker ◦ Resource management ◦ Job Scheduling / Management  Resource Manager: Manages the global assignment of compute resources to applications; ◦ A pure scheduler (capacity/fair scheduler) and an Application Manager to accept job submissions for Application Master;  Node Manager: the per-machine framework agent for monitoring the resource usage, reporting to the Scheduler;  Application Master: manages the application’s life cycle (scheduling and coordination), a single job or a DAG of jobs.  Container: a process started by Node Manager to grant an application the privilege to use a certain amount of resources.
  • 35. Resource Manager MapReduce Status Job Submission Client Node Manager Node Manager Container Node Manager App Mstr Node Status Resource Request YARN Architecture
  • 36. Tez: Accelerating YARN Query Processing  Open source Apache incubator project and Apache licensed.  Distributed execution framework targeted towards data-processing applications.  Based on expressing a computation as a dataflow graph (DAG, directed acyclic graph) .  Built on top of YARN – the resource management framework for Hadoop.  Key design themes: ◦ Ability to express, model and execute data processing logic (vertex and edge) by dataflow definition API; ◦ Flexible Input-Processor-Output task model: data format, read/write, vertex task; ◦ Performance via Dynamic Graph Reconfiguration: pluggable vertex management modules; ◦ Performance via Optimal Resource Management: efficient acquisition of resources from YARN.  A Tez task is constituted of all the Inputs, Processor and all the Output(s).  Design of Tez includes support for pluggable vertex management to collect relevant info. from tasks and change the dataflow graph at runtime to optimize for performance and resource usage.  Re-using containers: not needing to allocate each container via the YARN ResourceManager (RM).  A Tez session, maps to one instance of a Tez Application Master (AM): container reuse and caching.
  • 37. Tez: Accelerating YARN Query Processing
  • 38. Storm: Distributed, Real-Time  Built by Backtype, recently bought by Twitter, written in Clojure;  Tuples: ordered list of elements;  Streams: Unbounded sequence of tuples  Spout: Source of Stream ◦ E.g. Read from Twitter streaming API, event data,…  Bolts: Processes input streams and produces new streams ◦ E.g. Functions, Filters, Aggregation, Joins,…  Topologies: a DAG of spouts and bolts;  Tasks: instances of Spouts and Bolts;  Stream grouping b.w. spout and bolt:7 options ◦ All grouping, non grouping; ◦ Global grouping, local grouping; ◦ Shuffle grouping, direct grouping; ◦ Fields grouping.  Guarantee Message processing;  Multilang support and transactional topologies;  Applied for stream processing, continuous computation, distributed RPC;  Trident: high-level abstraction on top of Storm, like Pig of Hadoop.
  • 39. UI: web-based Nimbus: Master node like JobTracker Supervisor: work node to manage workers Zookeeper: store meta data UI Storm: System Architecture Spout Bolt
  • 40. Summingbird  A library to write MR programs like Scala on distributed MR platforms, with Storm (stream) & Scalding (batch) on top of Cascading;  Data: stream and snapshot;  Components in Summingbird: ◦ Producer: data stream abstraction for Platform to compile MR workflow ◦ Platform: implemented for any stream MR library; ◦ Source: stream of data ◦ Store: “reduce” of streaming MR; all key-value pairs’ snapshot represent; ◦ Sink: materialize an un-aggregated “stream” representation, not snapshot; ◦ Service: perform a “lookup join” from Store’s snapshot or Sink’s stream; ◦ Plan: final representation of the MR flow produced by a Platform.  Related projects: ◦ Algebird is an abstract algebra library for Scala; ◦ Bijection’s Injection typeclass to share serialization between different platforms and clients; ◦ Chill augments Kryo with options, and provides with Storm, Scala, Hadoop; ◦ Storehaus’s async key-value store traits mplement Summingbird’s client; ◦ Tormenta provides a layer over Storm’s Scheme and Spout interfaces.
  • 41. S4: Simple Scalable Streaming System  Apache Incubator S4 by Yahoo, written in Java; ◦ Real-time/decentralized/scalable/event-driven/stream processing; ◦ Actors programming model (PEs); ◦ All in-memory, no disk bottlenecks; ◦ Pluggable event serving policies: load shedding, throttling, blocking; ◦ Failover, checkpointing , replication and recovery; ◦ Dynamic load balancing/adaptive load management.  Communication, scheduling & distribution across containers; ◦ S4 applications are built as a graph of:  Processing elements (PEs): event handling  Streams of events that interconnect PEs ◦ S4 processing nodes: distributed PE containers/hosts for PE; ◦ S4 clusters define named ensembles of S4 processing nodes; ◦ S4 events are dispatched to nodes according to their key. ◦ PEs communicate asynchronously by sending events on streams. ◦ Communication Layer: cluster management/failover (ZooKeeper); ◦ S4 adapters: applications to convert external streams into S4 events.  Adapters are also S4 applications, then scaled easily.
  • 42. Hierarchical Perspective of S4
  • 43. Data Freeway  A scalable data stream framework at Facebook  Scribe: Simple push/RPC-based logging system  Calligraphus: Call sync every 7 seconds ◦ RPC -> File System  Each log category is represented by 1 or more FS directories  Each directory is an ordered list of files ◦ Bucketing support  Application buckets are application-defined shards.  Infrastructure buckets allows log streams from x B/s to x GB/s  Continuous Copier: File System to File System ◦ Low latency and smooth network usage ◦ Deployment; Implemented as long-running map-only job, can move to any simple job scheduler ◦ Coordination: Use lock files on HDFS for now, move to Zookeeper soon  PTail: File System -> Stream ( -> RPC ) ◦ Checkpoints inserted into the data stream ◦ Can roll back to tail from any data checkpoints ◦ No data loss/duplicates
  • 44. Data Freeway Architecture
  • 45. Scribe  Scribe is a server for aggregating streaming log data, designed to scale to a very large number of nodes and robust to network/node failures;  Scribe servers are arranged in a directed graph, with each server knowing only about the next server in the graph;  Scribe is unique in that clients log entries consisting of two strings, a category and a message; ◦ The category is a high level description of the intended destination of the message and can have a specific configuration in the scribe server;  The server allows for configurations based on category prefix, and a default configuration that can insert the category name in the file path;  Flexibility and extensibility is provided through the “store” abstraction; ◦ Stores are loaded dynamically based on a configuration file, and can be changed at runtime without stopping the server; ◦ Stores are implemented as a class hierarchy, and stores can contain other stores;  Scribe is implemented as thrift service using non-blocking C++ server.
  • 46. Apache Flume  Flume is a distributed service for collecting, aggregating, and moving large amounts of log data; ◦ A simple/flexible architecture based on streaming data flows; ◦ Robust/Fault tolerant with tunable reliability mechanisms and failover and recovery mechanisms; ◦ Use an extensible data model that allows for online analytic application;  Data flow model: ◦ An Event is a unit of data that flows through a Flume agent; a Flume agent is a process (JVM) that hosts the components; ◦ A Flume source stores an event into one or more channels (passive stores) until it’s consumed by a Flume sink; ◦ The sink puts the event into HDFS or forwards it to the Flume source of the next Flume agent (next hop) in the flow; ◦ The source and sink within the given agent run asynchronously with the events staged in the channel.  Set up: ◦ Flume agent configuration is stored in a local file (source, sink, channel); ◦ The agent knows what individual components to load and how they are connected to constitute the flow; ◦ Build multi-hop flows where events travel through multiple agents before reaching the final destination.
  • 47. Apache Flume
  • 48. Puma: Real-Time MR  Real time Data Pipeline developed in Facebook, to be open source soon ◦ Utilize existing log aggregation pipeline (Scribe-HDFS) ◦ Extend low-latency capabilities of HDFS (Sync+PTail) ◦ High-throughput writes (HBase)  Support for real time reliable aggregation: Unique user count, most frequent elements ◦ Utilize HBase atomic increments to maintain roll-ups ◦ Complex HBase schemas for unique-user calculations ◦ Store checkpoint information directly in HBase  Multiple Group-By operations per log line  The first key in Group-By is always time/date-related  New 2 versions ◦ Puma 2  Simple ◦ Puma 3  Better performance  PQL – Puma Query Language Log Stream Aggregations Storage Serving
  • 49. Puma2: Real-Time MR  Map phase with PTail ◦ PTail provide parallel data streams ◦ Divide the input log stream into N shards ◦ 1st version only supported random bucketing ◦ Now supports application-level bucketing  Reduce phase with HBase ◦ HBase does single increment on multiple columns ◦ Every row+column in HBase is an output key ◦ Aggregate key counts using atomic counters ◦ Also maintain per-key lists or other structures PTail Puma2 HBase Serving
  • 50. Puma3: Real-Time MR  Puma3 is sharded (split) by aggregation key.  Each shard is a hash map in memory.  Each entry in hash map is a pair of an aggregation key and a user-defined aggregation.  HBase as persistent key-value storage. PTail Puma3 HBase Serving Write workflow Checkpoint workflow Read workflow Join  Unique Counts Calculation ◦ Adaptive sampling ◦ Bloom filter (future)  Most frequent item (future) ◦ Lossy counting ◦ Probabilistic lossy counting Special Aggregations
  • 51. Amazon Kinesis  Kinesis scales for real-time processing of streaming big data;  Kinesis requires that a user create at least two applications—a “Producer” and a Kinesis application (also called a “Worker”)— using Amazon’s Kinesis APIs;  The “Producer” takes data from some source and converts it into a "Kinesis Stream," a continuous flow of 50-kilobyte data chunks sent in the form of HTTP PUTs;  The "Worker" takes the data from the Kinesis Stream and does whatever processing is required;  The Kinesis application can run on any type of Amazon EC2 instance, and Kinesis will auto-scale the instances to handle varying streaming loads;  The Kinesis SDK libraries, used to create Kinesis Producers and applications, only available for Java, but write Kinesis applications in any language by simply calling the Kinesis APIs directly;  Stream output is sent to Amazon’s S3, DynamoDB, or Redshift;  Kinesis can create DAGs of Kinesis applications and data streams. Page 52
  • 52. Amazon Kinesis Page 53
  • 53. NoSQL- Not Only SQL  Class of non-relational data storage systems (non-RDBS)  Usually do not require a fixed table schema nor use concept of joins  All NoSQL offerings relax one or more of the BASE/ACID properties ◦ Strong: ACID(Atomicity Consistency Isolation Durability) ◦ Weak: BASE(Basically Available Soft-state Eventual consistency )  Three major papers for NoSQL movement ◦ BigTable (Google) ◦ Dynamo (Amazon)  Gossip protocol (discovery and error detection)  Distributed key-value data store  Eventual consistency ◦ CAP Theorem: consistency, availability and partitions  NoSQL solutions fall into two major areas ◦ Key/Value or ‘the big hash table’.  Amazon S3 (Dynamo) ◦ Schema-less in multiple flavors, column, document or graph-based.  Cassandra (column-based)  CouchDB (document-based)  Neo4J (graph-based)  HBase (column-based) Consistency Partition tolerance Availability
  • 54. BigTable at Google  A Bigtable is a sparse, distributed, persistent multi-dim. sorted map  The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.  Rows with consecutive keys are grouped together as “tablets”.  Column keys are grouped into sets called “column families”, which form the unit of access control.  A column key is named using the following syntax: family:qualier.  Bigtable uses the distributed Google File System (GFS) to store log and data files.  The Google SSTable file format is used internally to store Bigtable data. ◦ An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.  Bigtable relies on a persistent distributed lock service ◦ Chubby (a name space).
  • 55. Dynamo: Key-value Store  Distributed, Highly available storage system from Amazon; ◦ SLA: Application can deliver its functionality in abounded time.  Simple interface associated with a key: get(key) and put(key, data) ◦ Binary objects (data<1MB) identified by a unique key.  Partitioning for scale incrementally ◦ consistent hashing: the output range of a hash function is treated as a “ring”. ◦ “virtual node” : Each node can be responsible for more than one virtual node.  Replication for high availability and durability ◦ “preference list”: The list of nodes that is responsible for storing a particular key.  Data versioning: vector clocks to capture causality between versions; ◦ A vector clock is a list of (node, counter) pairs.  Execution of get ()/put (): client, coordinator  Handling failures: “sloppy quorum” and hinted handoff  Replica synchronization: anti-entropy protocol using Merkle (hash) tree  Membership/Failure Detection ◦ Gossip-based protocol  Implementation: Java and APIs over HTTP ◦ BDB or MySQL;  Note: Amazon S3 service powered by Dynamo.
  • 56. Cassandra  A Decentralized Structured Storage System;  Design goals: ◦ High availability, eventual consistency, incremental scalability, optimistic replication, “knobs” to tune tradeoffs between consistency, durability and latency, low total cost of ownership, and minimal administration;  Architecture:  Each node communicates with each other through the Gossip protocol, which exchanges information across the cluster every second;  A commit log on each node to capture write activity; data durability assured  Data also written to an in-memory structure (memtable) and then to disk once the memory structure is full (an SStable);  It is a row-oriented, column structure;  A key space is akin to a database in the RDBMS world;  A column family is similar to an RDBMS table but is more flexible/dynamic;  A row in a column family is indexed by its key; other columns may be indexed  Cassandra ~= bigtable + dynamo
  • 57. HBase  HBase is an open-source, distributed, column-oriented database built on top of HDFS based on Google BigTable; ◦ Part of the Hadoop ecosystem (written in Java) ◦ Native connections to Map-Reduce  HBase by default manages a ZooKeeper instance as the authority on cluster state;  Structures data as tables of column-oriented rows ◦ Large, variable, number of columns per row ◦ Rows stored in sorted order ◦ Region: contiguous set of sorted rows, made of Stores ◦ Table: split roughly into equal sized regions ◦ RegionServer: keeps log of every update, manage region split ◦ Master: assigns Table Regions to RegionServers ◦ MemStore: Holds in-memory modifications to the Store.  HBase is not fully ACID-compliant  Can random read and write (no built-in joins) ◦ Single row operations (put, get, scan) ◦ Multiple row operations (scan, multiPut)
  • 58. HBase Architecture
  • 59. NewSQL: The new way to handle big data  A set of various new scalable/high-performance SQL database vendors (or databases); ◦ Migrate existing applications to adapt to new trends of data growth; ◦ Develop new applications on highly scalable OLTP systems; ◦ Rely on existing knowledge of OLTP (online transaction processing) usage.  Technical characteristics; ◦ Scalable performance of NoSQL systems for OLTP read-write workloads; ◦ ACID (Atomicity, Consistency, Isolation, Durability) guarantees, support for transactions; ◦ SQL as the primary mechanism for application interaction. ◦ A non-locking concurrency control mechanism so real-time reads will not conflict with writes, thus cause them to stall. ◦ An architecture providing much higher per-node performance than available from traditional RDBMS solutions. ◦ A scale-out, shared-nothing architecture, capable of running on a large number of nodes without bottlenecks.  NewSQL categorization; ◦ New databases: improving by making non-disk (memory) or new kinds of disks (flash/SSD) the primary data store. ◦ New MySQL storage engines: Storage engines developed, Xeround, Akiban, MySQL NDB cluster, GenieDB, Tokutek, etc. ◦ Transparent clustering: Cluster transparently, ensure scalability to provide transparent sharding to improve scalability.  Google Spanner is an example of NewSQL DB.
  • 60. Spanner: Google’s Globally- Distributed Database  Scalable, multi-version, globally distributed, and synchronously-replicated database;  Scale up to millions of machines across hundreds of datacenters and trillions of database rows;  Reshard data across machines as the data or the servers changes, and automatically migrates data across machines (or datacenters) to balance load and in response to failures; ◦ A Spanner deployment is called a universe; Spanner is organized as a set of zones; ◦ A zone has one zonemaster and between one hundred and several thousand spanservers; ◦ Directory (a bucketing abstraction): a set of contiguous keys that share a common prefix; ◦ A directory is the unit of data placement, also the smallest unit with geographic replication properties;  A data model based on schematized semi-relational tables, general-purpose transactions, externally consistent reads/writes and SQL-based query language;  Replication for global availability and geographic locality, and clients failover between replicas; ◦ A tablet is similar to Bigtable’s tablet abstraction and a single Paxos state machine on top of each tablet; ◦ Each spanserver has a lock table to implement concurrency control;  Data is stored in schematized semi-relational tables, versioned, time-stamped; ◦ TrueTime API, implemented by a set of time master machines per datacenter and a timeslave daemon per machine.
  • 61. Spanner: Google’s Globally- Distributed Database Spanner server organization Spanserver software stack TrueTime architecture
  • 62. Comparison of Puma, Storm, S4
  • 63. Kafka  A distributed publish-subscribe messaging system;  Maintains feeds of messages in categories called topics: a category or feed name to which messages are published;  Each partition is an ordered, immutable sequence of messages that is continually appended to a commit log; ◦ Each partition has one server which acts as the "leader" and zero or more servers which act as "followers";  Messaging traditionally has two models: queuing and publish-subscribe; ◦ Publish messages in the process to a Kafka topic producers; ◦ Subscribe to topics and consume the published messages by pulling: consumer;  Kafka cluster comprised of one or more servers called a broker to store published messages.  Efficiency on a single partition ◦ A very simple storage: log == list of files, message addressed by a log offset; ◦ Efficient data transfer: No message caching, zero-copy transfer, FS buffering; ◦ Stateless broker: Consumer maintains its own state, SLA-based retention.  Distributed coordination: Auto load balancing ◦ Make a partition within a topic the smallest unit of parallelism; ◦ No central “master” node, but ZooKeeper helps for a consensus service;  Delivery guarantees: messages in order delivered to a consumer. ◦ Built-in replication to store each message in multi brokers.
  • 64. Kafka Cluster
  • 65. Samza  Samza is a stream processing framework on the top of Hadoop (MRv2.0) ◦ Simple API: a very simple call-back based "process message" API; ◦ Managed state: snapshotting and restoration; ◦ Fault tolerance: work with YARN to migrate your tasks; ◦ Durability: uses Kafka to guarantee no messages get lost; ◦ Scalability: partitioned and distributed at every level; ◦ Pluggable: provides a pluggable API to run Samza with other messaging systems; ◦ Processor isolation: works with YARN, to give security and resource scheduling.  Concepts in Samza: ◦ Streams: immutable messages of a similar type or category; ◦ Jobs: code that performs a logical transformation on a set of input streams ; ◦ Partitions: each partition in the stream is a totally ordered sequence of messages; ◦ Tasks: the unit of parallelism of the job; ◦ Dataflow graphs: nodes - streams containing data, edges - jobs performing transformations ◦ Containers: unit of physical parallelism to runs one or more tasks.  Samza architecture: a stream processing built with Kafka and YARN ◦ Streaming: Kafka ◦ Executation: YARN ◦ Processing: Samza API ◦ Uses YARN and Kafka to provide stage-wise stream processing /partitioning.
  • 66. Samza Concepts and Architecture
  • 67. Machine Learning  “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” ◦ Supervised model: labeled data; ◦ Unsupervised model: unlabeled data; ◦ Semi-supervised model: both labeled and unlabeled data; ◦ Reinforcement Learning: learn by interacting with an environment.  Types of ML algorithms ◦ Prediction: predicting a variable from data ◦ Classification: assigning records to predefined groups ◦ Clustering: splitting records into groups based on similarity ◦ Association learning: seeing what often appears together with what  Relationship with others ◦ Artificial intelligence: emulate how the brain works with programming;  ML is a branch of AI ◦ Data mining: building models in order to detect the patterns; ◦ Statistical analysis: probabilistic models, on which to infer using data; ◦ Information retrieval: retrieval of information from a collection of data (doc).
  • 68. Some Issues in ML  Training/testing data (70%/30%)  Data unbalanced (one class’ data more than others) ◦ Sampling, learning algorithm modification (cost-sensitive), ensemble,…  “Open set” (how to handle unknown or unfamiliar classes);  Feature extraction ◦ Sparse coding, vector quantization,…  Curse of Dimensionality: Sensitivity to “noise” ◦ Dimension reduction, manifold learning/distance metric learning  Linear or non-linear model ◦ Local/Global minimum (convex/concave obj. function): Learning rate ◦ Regularization: L-1/L-2 norm ◦ Kernel trick: mapping nonlinear feature space to high dim. linear space  Discriminative or generative model ◦ Bottom up (conditional distribution) /Top down (joint distribution)  Over-fitting: Learn the “noise” ◦ Cross validation with grid search  Vanishing gradient and sensitivity of initialization  Performance evaluation ◦ Precision/Recall, confusion matrix, ROC, i.e. receiver operat. characteristic)
  • 69. “Data Unbalancing” Issue in ML  Resampling methods for balancing the data set.  Over-sampling, under-sampling, importance sampling;  Modification of existing learning algorithms.  Cost-sensitive learning;  One class classification;  Classifier ensemble (bagging, boosting, random forest…)  Measuring the classifier performance in imbalanced domains.  ROC, F-measure,…  Relationship between class imbalance and other data complexity characteristics.
  • 70. “Open Set” Issue in ML  How to handle unknown or unfamiliar classes?  Label as one of known classes or as unknown;  Zero shot learning/unseen class detection;  Novelty detection with null space methods;  One class SVM;  Multiple classes:  Artificial super class from all given classes;  Combine several one class classifiers learned separately;  K-nearest neighbors;
  • 71. “Curse of Dimensionality” in ML  Curse of dimensionality: distributing bins or basis functions uniformly in the input space may work in 1 dimension, but will become exponentially useless in higher dimensions;  Learning a "state-of-nature" from a number of samples in a high-dimensional feature space with each feature having a number of possible values, enormous amount of training data are required to ensure that there are several samples with each comb. of values;  With a fixed number of training samples, the predictive power reduces as the dimensionality increases, and this is known as the Hughes effect or Hughes phenomenon;  How to avoid it?  Dimension reduction: PCA, LDA, MDS;
  • 72. “Over-fitting” Issue in ML  A statistical model describes ”noise” instead of the underlying relationship;  Over-fitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations;  A model which has been over-fit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data;  How to avoid over-fitting?  Explicitly penalize overly complex models;  Test the model's ability to generalize by evaluating its performance on a set of data not used for training, which is assumed to approximate the typical unseen data that a model will encounter;  Methods: cross-validation, regularization, early stopping, pruning, Bayesian priors on parameters or model comparison;
  • 73. Large Scale Machine Learning  Data independent model ◦ Assumes each data instance can be independently computed, such as Hadoop  Data locally dependency model ◦ Assumes many data vertices locally connected with its neighbor vertices, each data vertex updates its own status in parallel according to the status of its connected neighbors vertices, like GraphLab.  Deep learning: Learn multi-layers of data represent. (features) ◦ Unsupervised pre-training + supervised fine tuning ◦ MLP, CNN, DBN, DBM, SDAE,…;  Online ML: fast and memory‐efficient ◦ Stochastic/incremental gradient descent (SGD)  Ensemble learning: easy to be distributed, scalable ◦ Boosting, bagging, stacking, random forest,…  Open Sources: Mahout, R, WEKA, MLPack, MLBase,…  ML on parallel machines ◦ GPU, cloud or cluster (distributed), multi-core,…
  • 74. Trade-off in Large Scale ML  Small scale vs Large scale ◦ We have a small-scale learning problem when the active budget constraint is the number of examples 𝑛. ◦ We have a large-scale learning problem when the active budget constraint is the computing time 𝑇.  Statistical Perspective ◦ It is good to optimize an objective function that ensures a fast estimation rate when the number of examples increases.  Optimization Perspective ◦ To efficiently solve large problems, it is preferable to choose an optimization algorithm with strong convergence properties.  Incorrect Conclusion ◦ To address large-scale learning problems, use the best algorithm to optimize an objective function with fast estimation rates  Learning with approximate optimization ◦ Stochastic gradient descent (historically associated with BP)
  • 75. Some Issues in Large Scale ML  Job scheduling ◦ Schedule and monitor “batch” jobs;  Parallel execution ◦ Distributed ◦ SIMD;  Auto Scaling ◦ Scale up (vertical) ◦ Scale out (horizontal)  Monitoring  Fault-tolerant ◦ Failover ◦ recover  Load balancing ◦ Distribute work load across the cluster  Work flow management ◦ Choreography ◦ Orchestration
  • 76. Page 77 Job Scheduling  A job scheduler is a program that enables an enterprise to schedule and, in some cases, monitor computer "batch" jobs (units of work, such as the running of a payroll program).  A job scheduler can initiate and manage jobs automatically by processing prepared job control language statements or through equivalent interaction with a human operator.  Functions:  Avoid starvation;  Maximize throughput;  Minimize response time;  Optimal use of resources.  Hadoop: FIFO, FAIR (Facebook), Capacity, Dynamic Priority Schedulers.
  • 77. Page 78 Auto-Scaling  Auto-scaling: scales up/down when the load increases/ decreases, ability to handle increasing amount of work gracefully;  Vertical scalability:  Scaling Up: maintain performance levels as concurrent request increases;  Horizontal scalability:  Scaling Out: meet demand through replication, across a pool of servers;  Dimensions  Load. Handling increasing load by adding resources;  Geographic. Maintain perf. in case geographically distributed systems;  Functional. Adding new features using minimum effort.  Amazon’s Cloud Watch: EC2 (CPU, Disk/Network I/O), ELB.
  • 78. Page 79 Load Balancing  Load balancing distributes workload across one or more servers, network interfaces, hard drives, or other computing resources;  A load balancer provides the means by which instances of applications can be provisioned and de-provisioned automatically, without requiring changes to the network or its configuration;  Determine the maximum connection rate that the various solutions are capable of supporting;  Failover: continuation of the service after the failure;  Amazon Elastic Load Balancer (ELB): it facilitates distributing incoming traffic among multiple AWS instances (like HAProxy);  Span Availability Zones (AZ), and can distribute traffic to different AZs;
  • 79. Page 80 Workflow Management  Workflow is loosely-coupled parallel application, consists of a set of computational tasks linked via data/control-flow dependencies;  how tasks are structured, who performs them, what their relative order is, how they are synchronized, how information flows to support the tasks and how tasks are being tracked.  An activity is a discrete step in a business process (workflow); Activities are orchestrated together in a workflow;  “Service choreography” – description of coordination between two/more parties .  “Service orchestration” – business process is modeled using workflows.  Amazon Simple Workflow (SWF): task coordination and state management for cloud apps;  Twitter Azkaban: A workflow scheduler that allows the independent pieces to be declaratively assembled into a single workflow.
  • 80. Apache Spark -1  Spark: an open source cluster computing system to make data analytics fast, both fast to run and fast to write (100x faster than Hadoop MR), developed at UC Berkeley;  Spark application: A driver program runs the main function and executes various parallel operations on a cluster.  Resilient distributed dataset (RDD): Collection of elements partitioned across nodes of cluster operated in parallel; ◦ RDDs automatically recover from node failures. ◦ Two types of RDD:  Parallelized collections, take an existing Scala collection and run functions on it in parallel;  Hadoop datasets, run functions on each record of a file in HDFS or supported by Hadoop ◦ Two types of operations:  Transformations, create a new dataset from an existing one,  “Distributed reduce” transformations operate on RDDs of key-value pairs  Actions, return a value to the driver program after computation on the dataset.  Shared Variables: used in parallel operations ◦ broadcast variables, cache a value in memory on all nodes ◦ accumulators, only “added” to, such as counters and sums
  • 81. Apache Spark - 2  Spark originally written in Scala, added with Java and Python API;  Spark streaming: Large scale stream processing framework.  Spark persists (caches) a dataset in memory across operations  Spark can run at Amazon Elastic MapReduce;  Mllib: Implementation of some Machine Learning functionality, as well associated tests and data generators; ◦ Binary classification: SVM, Naïve Bayes and Logistic Regression; ◦ Linear regression: SGD; ◦ Clustering: k-mean++, SVD; ◦ Collaborative filtering for recommender system: Alternating LS ◦ Gradient Descent and Stochastic GD  Bagel: Implement. of Google’s Pregel; ◦ GraphX: unified graph analytics, to replace Bagel;  Shark: Port of Apache Hive to run on Spark.
  • 82. Spark Streaming  Spark Streaming: Run a streaming computation as a series of very small, deterministic batch jobs; ◦ Scales to hundreds of nodes ◦ Achieves low latency ◦ Efficiently recover from failures ◦ Integrates with batch and interactive processing  Programming model: Dstream (Discretized Stream) ◦ Represents a stream of data ◦ Implemented as a sequence of RDDs  Languages: Scala API, Java API, Python API  DStream + RDDs = POWER; ◦ Combine live data streams with historical data; ◦ Combine streaming with MLlib, GraphX algos; ◦ Query streaming data using SQL;  FT: replicated in memory, recomputed from replicated data.
  • 83. Spark Runtime and Spark Streaming
  • 84. Petuum: Iterative-Convergent Distributed ML  Petuum comes from "perpetuum mobile“, a musical style (continuous steady stream of notes);  general-purpose distributed platform for Big Machine Learning;  execute iterative updates in a manner that most quickly minimizes the loss function in a large-scale distributed environment;  Statistical: error-bounded consistency schemes to decrease network communication, rescheduling of updates to decrease correlation effects and optimizing load-balancing;  A parameter server (Petuum-PS) for global parameter synchronization; ◦ a distributed key-value store that provides client machines shared-memory access to global parameters sharded on the server machines; ◦ a table-based user interface: global tables accessed globally by a row-column ID pair; ◦ Bounded Consistency: Stale Synchronous Parallel (SSP) Consistency and Value-Bounded Consistency; ◦ Process-Level and Thread-Level Caching: cashing a frequently accessed row in table, i.e. Least-Recently Used (LRU), Two-List LRU, and Priority LRU; ◦ Out-of-Core Storage Support: efficiently streaming data from hard disks or SSDs;
  • 85. Petuum: Iterative-Convergent Distributed ML  A structure-aware dynamic scheduler (STRADS) organize/distribute worker tasks; ◦ Dynamic Scheduling and Adaptive Load Balancing.  ML library includes (will be steadily enriched): ◦ Convolutional Neural Network (CNN) ◦ Distance Metric Learning ◦ Multiclass Logistic Regression ◦ Nonnegative Matrix Factorization ◦ Sparse Coding ◦ K-means ◦ MedLDA advanced topic model ◦ LDA topic model ◦ Matrix Factorization (collaborative filtering) ◦ Fully-connected Deep Neural Networks  Run bigger models on less hardware, very economical!
  • 86. Petuum: Iterative-Convergent Distributed ML Petuum parameter server topology. Servers and clients interact via a bipartite topology. name-node machine handles bookkeeping and assignment of keys to servers. STRADS architecture. The worker machines can be Petuum-PS clients or nodes without PS support. SSPTable Programming
  • 87. MillWheel: FT Stream Processing  Framework for building low-latency data-processing applications;  Part of Google Cloud Dataflow that also has Google Pub-Sub and FlumeJava;  Similar functionality to Apache Storm;  Specifying a directed computation graph and application code for individual nodes, and it manages persistent state and the continuous flow of records, with fault-tolerance; ◦ Called “transformations computations”; ◦ Computation code includes contacting external systems, manipulating other MillWheel primitives, or outputting data, once getting the input data; ◦ I/O are represented by (key, value, timestamp) triples; ◦ Keys are the primary abstraction for aggregation and comparison between different records; ◦ Streams are the delivery mechanism between different computations; ◦ Persistent state is an opaque byte string that is managed on a per-key basis; ◦ The low watermark for a computation provides a bound on the timestamps of future records arriving at that computation; ◦ Timers are per-key programmatic hooks that trigger at a specific wall time or low watermark value. ◦ FT: delivery guarantees, state manipulation.
  • 88. Mahout - Scalable ML • Apache Software Foundation Java library; • Scalable “machine learning“ and “data mining“ library that runs on Apache Hadoop mostly using map/reduce (M/R) paradigm; • Currently Mahout supports 3”C”s+Extras use cases: • Collaborative Filtering for recommendation: • Non-distributed: Taste • Distributed: user-based or item-based on Hadoop • Collaborative filter with matrix factorization • Classification: Perceptron, RBM, Winnow, Logistic regression, Naïve Bayes, Complementary NB, HMM, boosting, random forest, SVM, … • Clustering: Canopy, k-means, mean shift, Dirichlet process, spectral, min-hash, hierarchical, LDA (Latent Dirichlet Allocation), EM,… • Frequent Patten Mining: Parallel FP-Growth (PFP); • Locally Weighted Linear Regression; • SVD (singular value decomposition), PCA, ICA,... • Evolutionary algorithm: genetic algorithm, .... • Mahout can produce vector represent. from Lucene (Solr) index; • Run Mahout at Amazon EC2 and Amazon EMR.
  • 89. Mahout ML Algorithms Math Vectors/Matrices/ SVD RecommendersClusteringClassification Freq. Pattern Mining Genetic Utilities Lucene/Vectorizer Collections (primitives) Apache Hadoop Applications Examples
  • 90. Jubatus: real-time ML  Real-time and highly-scalable ML platform from NTT (Japan)  Online learning algorithms in Jubatus ◦ Linear classification  Perceptron/Passive Aggressive/Confidence Weighted Learning/Soft CWL/AROW/Normal HERD ◦ Regression: PA-based regression ◦ Nearest neighbor: LSH/Min-Hash/Euclid LSH ◦ Recommendation: Based on nearest neighbor ◦ Anomaly detection: LOF based on nearest neighbor ◦ Graph analysis: Shortest path/Centrality (PageRank) ◦ Simple statistics  Why Jubatus? ◦ Online learning requires frequent model updates ◦ Naïve distributed architecture leads to too many synchronization operations  Solution in Jubatus: Loose model sharing  Basic operations in Jubatus: only work locally to realize real-time ◦ UPDATE: local model is updated by each input sample, never shared! ◦ ANALYZE: each sample randomly go the server and then result goes back the client; ◦ MIX: send out model difference which is merged and distributed.  Everything in the memory (process the data on the fly)
  • 91. Jubatus Realtime applicationBatch application Simple Analysis (Statistics) Batch(Stored) Big Data In-depth Analysis (classification, estimation, prediction) Real time(Online)
  • 92. GraphLab in C++  A graph-based, high performance, distributed computation framework from Select Lab, CMU;  A unified multicore and distributed API;  Scalable: data and computation;  Access data directly from HDFS;  Data graph is graph with data associated with every vertex/edge;  Update Functions are operations applied on a vertex and transforming data in the scope of the vertex  Scheduler determines order of update Function evaluations ◦ Static: Synchronous or round robin schedule; ◦ Dynamic: Update functions insert new tasks into the schedule.  Shared Data table: global constant parameters can be stored;  Sync operation, similar to MR’s “reduce”: accumulate and apply;  Ensures race-free operation;  Guarantees sequential consistency.
  • 93. GraphLab Model Data Graph Shared Data Table Scheduling Update Functions and Scopes GraphLab Model
  • 94. BSP Model  Bulk Synchronous Parallel model (Leslie Valiant, 1990);  Advantages over MapReduce and MPI: ◦ Supports message passing paradigm style of application development ◦ Provides a flexible, simple, and easy-to-use small APIs ◦ Perform better than MPI for communication-intensive applications ◦ Guarantees impossibility of deadlocks or collisions in the communication mechanisms  A BSP computation proceeds in a series of global supersteps.  A superstep consists of three components: ◦ Concurrent computation. Each process can perform using
  • 95. BSP Model superstep
  • 96. Pregel for Graph Computing  It is a master/slave model for large-scale graph processing;  BSP model-based;  Vertex-centric model: for each vertex ◦ An arbitrary “value” that can be get/set. ◦ List of messages sent to it ◦ List of outgoing edges (edges have a value too) ◦ A binary state (active/inactive)  Combiners: ◦ Sometimes vertices only care about a summary value for the messages; ◦ Combiners allow for this (examples: min, max, sum, avg) ◦ Messages combined locally and remotely  Aggregators: ◦ Compute aggregate statistics from vertex-reported values ◦ During a superstep, each worker aggregates values from its vertices ◦ At the end of a superstep, partially aggregated values in a tree structure  Fault tolerance: save at the checkpoint and recover if necessary  Confined recovery: ◦ Failed worker “catch-up” to the rest, and other workers resend messages to it (under development?)
  • 97. Apache Giraph  Developed at Yahoo! and used by Facebook (open source of Pregel);  Reuse Hadoop as Map-Reduce job;  focuses on graph-based bulk synchronous parallel (BSP) computing;  ZooKeeper: responsible for computation state; ◦ partition/worker mapping ◦ checkpoint paths, aggregator values, statistics  Master: responsible for coordination ◦ assigns partitions to workers ◦ coordinates synchronization ◦ requests checkpoints ◦ aggregates aggregator values ◦ collects health statuses  Worker: responsible for vertices ◦ invokes active vertices compute() function ◦ sends, receives and assigns messages ◦ computes local aggregation values
  • 98. Apache Hama  General purposed BSP computing, not for graph processing only; ◦ Job management / monitoring ◦ Checkpoint recovery ◦ Pluggable message transfer architecture  Written In Java;  Local & (Pseudo) Distributed run modes  MapReduce like I/O API  Supports to run in the Clouds using Apache Whirr;  Supports to run with Hadoop 2.0: YARN;  Graph API  Application besides of graph processing: Machine learning:  Collaborative filtering  Clustering: k-means  Gradient descent for training in classification
  • 99. Apache Hama
  • 100. Vowpal Wabbit @ Yahoo/MS  Scalable, fast, efficient linear ML engine written in C/C++  Hadoop compatible AllReduce (not MPI style) ◦ “Map” job moves program to data; ◦ Read (and cache) all data, before initializing AllReduce; ◦ Use map-only Hadoop for process control and error recovery. ◦ Use AllReduce code to sync state; ◦ Always save input examples in a cache file to speed later passes; ◦ Use hashing trick to reduce input complexity.  Algorithms in VW 7.0 ◦ Binary classification and regression ◦ Multiclass classification, NN, active learning  Reduction: One Against All or Cost-Sensitive OaA or Sequence Prediction ◦ Latent Dirichlet Allocation, and matrix factorization ◦ Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS); ◦ Stochastic Gradient Descent (SGD) and CG;  Online learning/Active learning: no need to load all data into memory ◦ Dimension correction  Feature caching (adaptive learning)  Feature hashing (importance update)
  • 101. AllReduce in VW Every node begins with a number (vector) Every node ends up with the sum; Ideal to sum local gradients, weights; Creates a spanning tree over the nodes; At each iteration Nodes compute gradient over local data; AllReduce computes gradient over entire data. Each node receives a subset of data;
  • 102. Trident-ML  A real-time online ML library using scalable online algorithms;  Built on top of Storm, running on a cluster of machines and supports horizontal scaling;  Based on Trident, a high-level abstraction of Storm;  Trident-ML currently supports: ◦ Linear classification (Perceptron, Passive-Aggresive, Winnow, AROW, KLD for text) ◦ Linear regression (Perceptron, Passive-Aggresive) ◦ Clustering (KMeans) ◦ Feature scaling (standardization, normalization) ◦ Text feature extraction ◦ Stream statistics (mean, variance) ◦ Pre-Trained Twitter sentiment classifier  Trident-Ml is hosted on Clojars (a Maven repository).  Trident-ML process unbounded streams of data implemented by an infinite collection of Instance or TextInstance.
  • 103. Storm-Pattern  Based on Cascading’s sub-project, Pattern, which uses flows as containers for ML models; ◦ Cascading is a de-facto Java application framework that enables easy large scale developing work of rich Data Analytics and Data Management applications with Apache Hadoop and its API.  Importing PMML (Predictive Model Markup Language) model descriptions from R, SAS, Weka, RapidMiner, KNIME,SQL Server, etc. ◦ PMML, an XML-based file format developed by the Data Mining Group;  Working in tandem with the Lingual JDBC driver, modeling tools can pull data directly off Hadoop to train or test model quality; ◦ Lingual executes SQL queries as Cascading app. on Hadoop clusters.  Current support for PMML includes: ◦ Random Forest in PMML 4.0+ exported from R/Rattle; ◦ Linear Regression in PMML 1.1+; ◦ Hierarchical Clustering and K-Means Clustering in PMML 2.0+; ◦ Logistic Regression in PMML 4.0.1+.
  • 104. Samoa: Big Data Mining  Scalable Advanced Massive Online Analysis; ◦ A platform of stream data mining on S4/Storm from Yahoo, to be open source soon;  Short term algorithms: ◦ Hoeffding tree, K-means, Gradient boosted decision tree;  Long term algorithms: ◦ Integrate with add-ons packages (like R); ◦ Implement most common ML methods (like Mahout).  Algorithmic implement: Horizontal/Vertical data parallelism;  Platform design: ML Developer API;  Deployment and runtime.
  • 105. Samoa Runtime
  • 106. Reference  N. Marz and J.Warren. Big Data: Principles and Best Practices of Scalable Real Time Data Systems. Manning Publications Co. Shelter Island, NY, 2013.  J. Dean and S. Ghemawat. “MapReduce: Simplified data processing on large clusters”. Communications of the ACM, 51(1), 2008  F Chang, et al. "Bigtable: A Distributed Storage System for Structured Data", 7th Symp. on Operating System Design & Implementation. 2006.  G. de Candia et al., “Dynamo: Amazon’s highly available key-value store”. 21st ACM  SIGOPS symposium on Operating systems principles, pages 205-220. ACM, 2007.  J. Lin and D. Ryaboy. “Scaling big data mining infrastructure: The twitter experience”. SIGKDD Explorations,14(2), 2012.  R Bekkerman, M Bilenko and J Langford, “Scaling Up Machine Learning”, Tutorial, KDD 2011.  J Gray & D Borthakur, “Real time Apache Hadoop at Facebook”, at SIGMOD, 2011  L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. “S4: Distributed Stream Computing Platform”. ICDM Workshops, 2010.  S Owen, R Anil, T Dunning, and E Friedman, Mohout at Action, Manning Publications Co. Shelter Island, NY, 2012.  Hadoop, http://hadoop.apache.org. • Storm, http://storm-project.net. • S4, http://incubator.apache.org/s4/. • Zookeeper, http://zookeeper.apache.org/. • HBase, http://hbase.apache.org/. • Spark, http://spark.incubator.apache.org/ • Mahout, http://mahout.apache.org/ • Jubatus, http://jubat.us/en/overview.html. • Graphlab, http://graphlab.org/home/ • Vowpal Wabbit, http://hunch.net/~vw/  Samza, http://samza.incubator.apache.org/  Samoa, http://samoa-project.net/
  • 107. Thanks!