Scalable, Distributed, Deep Machine Learning for Big Data


Published on

Big Data, Parallel Computing, Cloud computing, Lambda architecture, MapReduce, GFS, Hadoop, HFS, Precolator, Caffeine, Pregel, Drill, chukwa, Hive, Pig, Scribe, Flume, Thrift, YARN, Storm, Summingbird, S4, ZooKeeper, Data Freeway, Puma1/2/3, NoSQL, BigTable, Dynamo, Cassandra, HBase, Kafka, Samza, Large Scale Machine Learning, overfitting, curse of dimensionality, load balancing, auto scaling, job scheduling, work flow, Spark, Flink, Mahout, Jubatus, Graphlab, BSP, Dremel, Giraph, Hama, Vowpal Wabbit, Strident ML, Storm-Pattern, Samoa, MXNET, FireCaffe, Spark on Caffe, SparkNet, DMTK, CNTK at MSR.

Published in: Technology, Education
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Scalable, Distributed, Deep Machine Learning for Big Data

  1. 1. Scalable, Distributed, Deep Machine Learning for Big Data Yu Huang Sunnyvale, California
  2. 2. Outline  Big Data - Volume, Variety, Velocity  Parallel Computing and Cloud computing  Lambda architecture: Batch, Speed and Serving Layers  Batch processing  MR: a program model from functional programing;  Hadoop: MR implementation from Yahoo  YARN (MRv2 or next gen Hadoop)  Hive: data warehouse  Pig: high level data-flow language  Zookeeper: high-performance coordination service  Chukwa: data collection system  Precolator: web search index at Google  Ceffeine: search based on Precolator  Farmer/Panda: Google SOE (Search Engine Optimization)  Tez: Accelerating YARN Query Processing  Cascading: A data processing API and processing query planner  Scalding: An extension to Cascading at Twitter
  3. 3. Outline  Stream processing ◦ Apache Thrift: scalable cross-language services from Facebook ◦ Apache Flume: stream data collection ◦ Storm: Stream processing from Twitter ◦ Summingbird: a lib to write MR programs on MR at Twitter ◦ S4: Stream processing from Yahoo ◦ Scribe: server for stream data aggregating at Facebook ◦ Data Freeway: data stream at Facebook ◦ Puma: Stream processing from Facebook ◦ Kafka: distributed messaging system at Linkedin, then Apache ◦ Samza: stream processing from LinkedIn ◦ Kinesis: real-time stream processing at Amazon ◦ Dremel: Scalable, interactive ad-hoc query system at google ◦ Apache Grill: Implementation of Google BigQuery ◦ MillWheel: FT stream processing at Google ◦ Apache Flink - Distributed stream and batch data processing
  4. 4. Outline  NoSQL-Not Only SQL database ◦ Google Bigtable ◦ Amazon Dynamo ◦ Cassandra by Facebook ◦ Hbase: like Bigtable  New SQL: ◦ Google Spanner  Graph-based  Spark – Lightning-Fast Cluster Computing  Graphlab – Big Machine Learning on Graphs at UC Berkeley  BSP (Bulk Synchronous Parallel ) Model  Google Pregel - BSP based graph computing  Apache Giraph - open source for Pregel  Apach Hama - BSP based ML  Machine learning and some issues  Deep learning: Big model and big data
  5. 5. Outline ◦ Large scale machine learning and trade-off ◦ Large Scale Machine Learning ◦ Mahout - Scalable ML on Hadoop ◦ Jubatus – Distributed Online Real-time ML ◦ Vowpal Wabbit – Fast Learning at Yahoo/MS ◦ Trident ML and Storm Pattern: ML on Storm, YARN ◦ Samoa: ML on S4, Storm ◦ DMTK: Distributed Machine Learning Toolkit @ MSR  Issues in Scalable Distributed ML ◦ Load balancing ◦ Auto scaling ◦ Job Scheduling ◦ Workflow management ◦ Fault tolerant  Data and Model Parallelism  Parameter Server Framework  Peer-to-Peer Framework
  6. 6. Outline  Distributed Deep Learning ◦ YahooLDA: Scalable parallel framework in latent variable models ◦ DistBelief – Distributed deep learning on cluster ◦ H2O – Distributed deep learning on Spark ◦ Adam at MSR – distributed deep learning ◦ DL4J – open source for deep learning on Hadoop and Spark ◦ Petuum – distributed machine learning ◦ SINGA – distributed deep learning ◦ TensorFlow: Google large scale distributed DL ◦ MXNET: heterogeneous distributed deep learning ◦ Caffe on Spark @Yahoo and SparkNet with Caffe @ UC Berkeley ◦ CNTK @MSR ◦ Elephas: Keras & Spark ◦ PaddlePaddle @Baidu  Distributed learning and optimization ◦ Proximal splitting/Auxiliary coordinates; ◦ Bundle (sub-gradient); ◦ Shotgun: parallelized CDM (coordinate descent method) ◦ Asynchronous SGD; ◦ Hogwild/Dogwild.
  7. 7. Big Data  Volume (large amounts of data gathered);  Variety (various degrees of structure);  Velocity (how data flow, at high rates);  Value (business);  Variability (changes);  Veracity (quality).  Data as a Service (DaaS) in the cloud;  Two main strategies for dealing with big data: ◦ Sampling; ◦ Distributed systems.  Big Data Challenges ◦ Protecting privacy; ◦ Integration of big data technologies into enterprise landscape; ◦ Addressing increasing real time needs with increasing data volume and varieties; ◦ Leveraging cloud computing with big data storage and processing.
  8. 8. Big Data Instances  One billion data instance ◦ Web-scale ◦ Guaranteed to contain data in different formats  ASCII text, pictures, javascript code, PDF documents… ◦ Guaranteed to contain (near) duplicates ◦ Likely to be badly preprocessed  ◦ Storage is an issue  One trillion data instance ◦ Beyond the reach of the modern technology ◦ Peer-to-peer paradigm is (arguably) the only way to process the data ◦ Data privacy / inconsistency / skewness issues  Can’t be kept in one location  Is intrinsically hard to sample
  9. 9. Big Data Analysis Pipeline
  10. 10. Parallel Computing  Data-Instruction; ◦ SIMD, MIMD, …  Data intensive ◦ Cloud computing  Compute intensive ◦ GPU computing  Shared memory: OpenMP  Distributed memory: MPI,  Hybrid: MR
  11. 11. Cloud Computing  A model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (Ethernet), usually for large Internet services;  Dynamic provision of services & resource pools in a coordinated fashion;  Cloud computing infrastructure is just a web service interface to operating system virtualization (via hypervisor);  Heterogeneous by virtualization;  Everything as a service (XaaS);  Data intensive: big data;  Distributed parallel, more like utility computing;  Not grid computing.
  12. 12. X-as-a-Service
  13. 13. Lambda Architecture  Equation “query = function(all data)” which is the basis of all data systems (data is more than information);  Human fault-tolerance – the system is unsusceptible to data loss or data corruption  Data immutability – store data in it’s rawest form immutable and for perpetuity.  Re-computation – with the two principles above it is always possible to (re)-compute results  Layered structure: ◦ Batch layer: unrestrained batch compute, horizontal scalable, high latency, read-only database, raw dataset, override speed layer (like Hadoop); ◦ Speed layer: only new data, stream processing, continuous compute, transactional, limited storage of windowed data (such as Storm); ◦ Serving layer: query batch views by load and random access.  Can discard any view, batch and real time, and just recreate everything from the master data.  Mistakes are corrected via recomputation. ◦ Write bad data? Remove the data & recompute. ◦ Bug in view generation? Just recompute the view.  Data storage is highly optimized.
  14. 14. Lambda Architecture Flowchart
  15. 15. Data Analytics System Architecture Online transaction processing Facebook Apache Apache
  16. 16. Fault Tolerance in the Data Stream System for High Availability
  17. 17. Interaction Model in the Stream Processing System (Push/Pull)
  18. 18. Map-Reduce  A program model borrowed from functional programming  Separate details of the original problem from of parallelism; ◦ map() produces one or more intermediate (key/value pairs) from the split input (“shards”); ◦ reduce() combines intermediate (key/value pairs) into final files after partitioning and sorting by key;  Scale to a large cluster of machines from a single machine;  Fault tolerance: Map or Reduce;  Locality: Distributed GFS chunks;  Bottleneck: The Reduce phase can’t start until the Map phase is completely finished (batch, not stream, processing): ◦May not suitable for real time processing and in-depth analysis.
  19. 19. Map-Reduce Pipelinee
  20. 20. Hadoop  HDFS: data storage and transfer, as GFS in Hadoop; ◦ NamedNode (job tracker), DataNode (task tracker); ◦ Master Node, Slave Node; ◦ Error handling: replication (3 by default);  Job Tracker: scheduling, JobConf and JobClient;  Task Tracker: status, TaskRunner, map or reduce;  Data In/Out: ◦ HDFS block size in Input Splits ◦ # of reducers in Output;  Task Failure: report;  Job Scheduler: ◦ FIFO, Fair, Capacity,…
  21. 21. Map-Reduce in Hadoop JobTracker MapReduce job submitted by client computer Master node TaskTracker Slave node Task instance TaskTracker Slave node Task instance TaskTracker Slave node Task instance
  22. 22. Hive & Pig  Hive: A database/warehouse on top of Hadoop; ◦ SQL as a familiar data warehousing tool ◦ Extensibility – Types, Functions, Formats, Scripts ◦ Scalability and Performance; ◦ Rich data types (structs, lists and maps); ◦ Efficient impl.s of SQL filters, joins and group-by’s on top of MR; ◦ Easy interactions with different programming languages;  HQL is like SQL.  Pig: A platform for easier analyzing large data sets; ◦ Pig Latin: data flow language similar to scripting languages; ◦ Pig Engine: parses, optimizes and automatically executes Pig Latin scripts; ◦ User defined functions for col. transform (TOUPPER), or aggregation (SUM);  UDFs to take advantage of the combiner. ◦ Four join implement.s built in: hash, fragment-replicate, merge, skewed; ◦ Writing load and store functions is easy once an I/O Format exist; ◦ Piggybank - a collection of user contributed UDFs; ◦ DataFu - LinkedIn's collection of Pig UDFs.
  23. 23. Apache Thrift  Software framework for scalable cross-language services, at Facebook;  A software stack + a code generation engine to build services b.t.w. C++, Java, Python, PHP, Ruby, Erlang, Perl, C#, OCaml and Delphi, etc.;  Key components in this open source: ◦ Type: for users to develop using completely natively defined types; ◦ Transport: used by the generated code to facilitate data transfer; ◦ Protocl: a certain messaging structure in data transport, agnostic to encoding; ◦ Versioning: staged rollouts of changes to deployed services; ◦ RPC implementation: TProcessor instance for data stream processing to realize remote procedure calls (RPC), and TServer abstraction;  The interface definition language (IDL) allows for definition of Types: ◦ A Thrift IDL file is processed by the code generator to produce code for the target languages to support the defined structs and services in the IDL file.  Similar systems ◦ SOAP. Designed for web services via HTTP, excessive XML parsing overhead; ◦ CORBA. Relatively comprehensive, debatably overdesigned and heavyweight; ◦ Avro. Dynamic typing, untagged data, no manually-assigned field IDs; ◦ COM. Embraced mainly in Windows client software. Not entirely open solution; ◦ Pillar. Lightweight and high-performance, but missing versioning & abstraction; ◦ Protocol Buffers. Closed-source, owned by Google.
  24. 24. Apache Thrift
  25. 25. Apache Thrift The Thrift stack is a common class hierarchy implemented in each language that abstracts out the tricky details of protocol encoding and network communication
  26. 26. Chukwa  A data collection system for monitoring large distributed systems;  Provides flexible/powerful toolkit to display, monitor, and analyze results;  Architecture: ◦ Agents - run on each machine and emit data; ◦ Collectors - receive data from the agent and write it to stable storage; ◦ MapReduce jobs - parsing and archiving the data; ◦ Hadoop Infrastructure Care Center - a web-portal style interface.
  27. 27. ZooKeeper in Hadoop  A shared hierarchical name space of data registers;  Exposes common services in simple interface: ◦ Naming, configuration management, locks & synchronization, group membership services, leader selection;  Each node in the namespace is called as a ZNode. ◦ Persistent Nodes, Ephemeral Nodes, Sequence Nodes; ◦ Every ZNode has data and can optionally have children; ◦ Read requests are processed locally at the server; ◦ Write requests are forwarded to the leader.  ZNode paths: every ZNode exists at some path ◦ Canonical, absolute, slash-separated; ◦ No relative references; ◦ Names can have Unicode characters.  ZNode watches can be set on ZNodes ◦ One time change triggers, always ordered.  A client connects to ZooKeeper and initiates a session;  Consistency guarantees;  Support Kerberos security.
  28. 28. ZooKeeper Service  ZooKeeper service is replicated over a set of machines;  All machines store a copy of the data (in memory)‫‏‬;  A leader is elected on service startup;  Clients only connect to a single ZooKeeper server & maintains a TCP connection;  Client can read from any Zookeeper server, writes go through the leader & needs majority consensus.
  29. 29. Precolator  Describes how web search index is kept up to date, at google; ◦ Google’s indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines;  Incremental update to big data: code in Java, no need for batch process;  Provide transaction/locking, based on GFS, build on top of BigTable;  Architecture: ◦ Applications are a sequence of observers; ◦ A observer is called via notification; ◦ A notification is triggered when table data changes; ◦ Applications call BigTable’s TableServers via RPC; ◦ TableServers call GFS ChunkServer;
  30. 30. Precolator  Random accesses to the document repository while maintaining data invariants;  Faster than comparable MapReduce, improved latency (100x), , reduce the doc’s average age by 50%;  Time stamping and locking via Chubby lock server;
  31. 31. Caffeine  Caffeine is a new search scheme (algorithm) based on Precolator;  Even with changes, most white hat optimization tactics continue to prevail;  More competition for single, generic-type keywords, less stability of rankings, and increased focus on long-tail keywords in SEO;  Feature site titles and snippets with higher phrase/keyword density.  Faster index: returned at faster speeds  Fresher results: more current, such as blog posts the last few days.  More emphasis on social media, like Facebook, LinkedIn, Blogger, etc.  Less emphasis on universal search. ◦ Lower on the page to make paid search more visible.  Increased prominence of video. ◦ As prominently featuring video listings.  Keywords in domain name. ◦ Do weigh keyword domain names even higher. ◦ For a new site, a microsite with your keywords embedded within the URL.  “organize the world's information, make it universally accessible and useful”.
  32. 32. Panda/Farmer Update  SEO (Search Engine Optimization): ‘crawl’ the web (spider?); create this page Index and the Quality Team; the Spam Team throwing away stuff in the Index that shouldn’t be there (by the Crawl Team as well).  Google Mayday update: degrade lower-quality websites, place more weight on quality signals, lowering weight of textual relevancy signals. ◦ Anti-spam and user behavior.  Google Farmer (renamed as Panda) update: hurt “Content Farms”, or sites that contain huge amounts of content of poor quality in order to rank on as high number of keyword combinations as possible; ◦ Placing the emphasis on user experience (average time spent on the site/specific page, bounce rate, Click Through Rate etc. ) ◦ The social trend - “+1” buttons was added near each result; ◦ Personalized Search - The changes in results between users could arise from geographic differences, daytime changes; If the user is logged in to Google account the results would be adjusted even further since Google’s servers collect information about the user and his browsing habits.
  33. 33. Panda/Farmer Update
  34. 34. Dremel  Scalable, interactive ad-hoc query system for analysis of nested data; ◦ multi-level execution trees and SQL-like language to express ad hoc queries ◦ column-striped storage representation of nested data  BigQuery, interactive query service as external implement. of Dremel; ◦ Hive and Pig are slow  Data model is based on strongly-typed nested records  Tablet Storage and Horizontal Partitioning to save space  Levels are packed as bit sequence;  Queries based on their priorities and load balances with fault tolerance ◦ Slots and histograms ◦ Handles stragglers ◦ Tablets are three-way replicated  Interoperates with Google's data management tools ◦ In situ data access (e.g., GFS, Bigtable) ◦ MapReduce pipelines
  35. 35. Dremel schema records Repetition Level Definition Level
  36. 36. Apache Drill • Open source Implementation of Google BigQuery • Flexibility: broader range of query languages  Fast ◦ Low latency queries ◦ Columnar execution: like google dremel ◦ Complement native interfaces and MapReduce/Hive/Pig  Open ◦ Community driven open source project ◦ Under Apache Software Foundation  Modern ◦ Standard ANSI SQL:2003 (select/into) ◦ Nested/hierarchical data support ◦ Schema is optional  Query any HBase, Cassandra or MongoDB table ◦ Supports RDBMS, Hadoop and NoSQL  DrQL: SQL-like query language  Mongo Query Language
  37. 37. YARN  Yet Another Resource Negotiator: MRv2 (Next Gen. Hadoop); ◦ Predictable Latency – A major customer concern; ◦ Support for alternate programming paradigms to MR.  Separate the tasks of Job Tracker ◦ Resource management ◦ Job Scheduling / Management  Resource Manager: Manages the global assignment of compute resources to applications; ◦ A pure scheduler (capacity/fair scheduler) and an Application Manager to accept job submissions for Application Master;  Node Manager: the per-machine framework agent for monitoring the resource usage, reporting to the Scheduler;  Application Master: manages the application’s life cycle (scheduling and coordination), a single job or a DAG of jobs.  Container: a process started by Node Manager to grant an application the privilege to use a certain amount of resources.
  38. 38. Resource Manager MapReduce Status Job Submission Client Node Manager Node Manager Container Node Manager App Mstr Node Status Resource Request YARN Architecture
  39. 39. Tez: Accelerating YARN Query Processing  Open source Apache incubator project and Apache licensed.  Distributed execution framework targeted towards data-processing applications.  Based on expressing a computation as a dataflow graph (DAG, directed acyclic graph) .  Built on top of YARN – the resource management framework for Hadoop.  Key design themes: ◦ Ability to express, model and execute data processing logic (vertex and edge) by dataflow definition API; ◦ Flexible Input-Processor-Output task model: data format, read/write, vertex task; ◦ Performance via Dynamic Graph Reconfiguration: pluggable vertex management modules; ◦ Performance via Optimal Resource Management: efficient acquisition of resources from YARN.  A Tez task is constituted of all the Inputs, Processor and all the Output(s).  Design of Tez includes support for pluggable vertex management to collect relevant info. from tasks and change the dataflow graph at runtime to optimize for performance and resource usage.  Re-using containers: not needing to allocate each container via the YARN ResourceManager (RM).  A Tez session, maps to one instance of a Tez Application Master (AM): container reuse and caching.
  40. 40. Tez: Accelerating YARN Query Processing
  41. 41. Storm: Distributed, Real-Time  Built by Backtype, recently bought by Twitter, written in Clojure;  Tuples: ordered list of elements;  Streams: Unbounded sequence of tuples  Spout: Source of Stream ◦ E.g. Read from Twitter streaming API, event data,…  Bolts: Processes input streams and produces new streams ◦ E.g. Functions, Filters, Aggregation, Joins,…  Topologies: a DAG of spouts and bolts;  Tasks: instances of Spouts and Bolts;  Stream grouping b.w. spout and bolt:7 options ◦ All grouping, non grouping; ◦ Global grouping, local grouping; ◦ Shuffle grouping, direct grouping; ◦ Fields grouping.  Guarantee Message processing;  Multilang support and transactional topologies;  Applied for stream processing, continuous computation, distributed RPC;  Trident: high-level abstraction on top of Storm, like Pig of Hadoop.
  42. 42. UI: web-based Nimbus: Master node like JobTracker Supervisor: work node to manage workers Zookeeper: store meta data UI Storm: System Architecture Spout Bolt
  43. 43. Cascading  A data processing API and processing query planner used for data processing workflows on a single computing node or distributed cluster;  The processing model is based on a metaphor of pipes (data streams) and filters (data operations). Thus the API allows to assemble pipe assemblies that split, merge, group, or join streams of data; ◦ Data record as tuple, a simple chain of pipes without forks or merges a branch, an interconnected set of pipe branches a pipe assembly, and a series of tuples passing through a pipe branch/assembly a tuple stream; ◦ Before executed, a pipe assembly bound to taps, data sources and sinks; The result of binding one or more pipe assemblies to taps is a flow; ◦ Multiple flows can be grouped together and executed as a single process. Such a collection of flows is called a cascade.  Pipe assemblies can examine, filter, organize, and transform the tuple data as the streams move through the pipe assemblies; ◦ Operation, Tuple, and Fields;  The Hadoop MapReduce Job Planner is an internal feature of Cascading;  Cascade executes a set of Flows on a target cluster in dependency order.
  44. 44. Scalding  An extension to Cascading that enables application development with Scala, running on Hadoop, maintained by Twitter;  Comparable to Pig, but offers tight integration with Scala, bringing advantages of Scala to MapReduce jobs;  It provides functionality from custom join algorithms to multiple APIs (Fields-based, Type-safe, Matrix) to build robust data applications;  Two APIs: field based as primary API and Type Safe as secondary API;  Scalding Model: ◦ Source objects read/write data (from HDFS, DBs, MemCache, etc...); ◦ Pipes represent the flows of the data in the job, a distributed list;  It adds a Matrix API useful for graph and machine learning algorithms;
  45. 45. Summingbird  A library to write MR programs like Scala on distributed MR platforms, with Storm (stream) & Scalding (batch) on top of Cascading;  Data: stream and snapshot;  Components in Summingbird: ◦ Producer: data stream abstraction for Platform to compile MR workflow ◦ Platform: implemented for any stream MR library; ◦ Source: stream of data ◦ Store: “reduce” of streaming MR; all key-value pairs’ snapshot represent; ◦ Sink: materialize an un-aggregated “stream” representation, not snapshot; ◦ Service: perform a “lookup join” from Store’s snapshot or Sink’s stream; ◦ Plan: final representation of the MR flow produced by a Platform.  Related projects: ◦ Algebird is an abstract algebra library for Scala; ◦ Bijection’s Injection typeclass to share serialization between different platforms and clients; ◦ Chill augments Kryo with options, and provides with Storm, Scala, Hadoop; ◦ Storehaus’s async key-value store traits mplement Summingbird’s client; ◦ Tormenta provides a layer over Storm’s Scheme and Spout interfaces.
  46. 46. S4: Simple Scalable Streaming System  Apache Incubator S4 by Yahoo, written in Java; ◦ Real-time/decentralized/scalable/event-driven/stream processing; ◦ Actors programming model (PEs); ◦ All in-memory, no disk bottlenecks; ◦ Pluggable event serving policies: load shedding, throttling, blocking; ◦ Failover, checkpointing , replication and recovery; ◦ Dynamic load balancing/adaptive load management.  Communication, scheduling & distribution across containers; ◦ S4 applications are built as a graph of:  Processing elements (PEs): event handling  Streams of events that interconnect PEs ◦ S4 processing nodes: distributed PE containers/hosts for PE; ◦ S4 clusters define named ensembles of S4 processing nodes; ◦ S4 events are dispatched to nodes according to their key. ◦ PEs communicate asynchronously by sending events on streams. ◦ Communication Layer: cluster management/failover (ZooKeeper); ◦ S4 adapters: applications to convert external streams into S4 events.  Adapters are also S4 applications, then scaled easily.
  47. 47. Hierarchical Perspective of S4
  48. 48. Data Freeway  A scalable data stream framework at Facebook  Scribe: Simple push/RPC-based logging system  Calligraphus: Call sync every 7 seconds ◦ RPC -> File System  Each log category is represented by 1 or more FS directories  Each directory is an ordered list of files ◦ Bucketing support  Application buckets are application-defined shards.  Infrastructure buckets allows log streams from x B/s to x GB/s  Continuous Copier: File System to File System ◦ Low latency and smooth network usage ◦ Deployment; Implemented as long-running map-only job, can move to any simple job scheduler ◦ Coordination: Use lock files on HDFS for now, move to Zookeeper soon  PTail: File System -> Stream ( -> RPC ) ◦ Checkpoints inserted into the data stream ◦ Can roll back to tail from any data checkpoints ◦ No data loss/duplicates
  49. 49. Data Freeway Architecture
  50. 50. Scribe  Scribe is a server for aggregating streaming log data, designed to scale to a very large number of nodes and robust to network/node failures;  Scribe servers are arranged in a directed graph, with each server knowing only about the next server in the graph;  Scribe is unique in that clients log entries consisting of two strings, a category and a message; ◦ The category is a high level description of the intended destination of the message and can have a specific configuration in the scribe server;  The server allows for configurations based on category prefix, and a default configuration that can insert the category name in the file path;  Flexibility and extensibility is provided through the “store” abstraction; ◦ Stores are loaded dynamically based on a configuration file, and can be changed at runtime without stopping the server; ◦ Stores are implemented as a class hierarchy, and stores can contain other stores;  Scribe is implemented as thrift service using non-blocking C++ server.
  51. 51. Apache Flume  Flume is a distributed service for collecting, aggregating, and moving large amounts of log data; ◦ A simple/flexible architecture based on streaming data flows; ◦ Robust/Fault tolerant with tunable reliability mechanisms and failover and recovery mechanisms; ◦ Use an extensible data model that allows for online analytic application;  Data flow model: ◦ An Event is a unit of data that flows through a Flume agent; a Flume agent is a process (JVM) that hosts the components; ◦ A Flume source stores an event into one or more channels (passive stores) until it’s consumed by a Flume sink; ◦ The sink puts the event into HDFS or forwards it to the Flume source of the next Flume agent (next hop) in the flow; ◦ The source and sink within the given agent run asynchronously with the events staged in the channel.  Set up: ◦ Flume agent configuration is stored in a local file (source, sink, channel); ◦ The agent knows what individual components to load and how they are connected to constitute the flow; ◦ Build multi-hop flows where events travel through multiple agents before reaching the final destination.
  52. 52. Apache Flume
  53. 53. Puma: Real-Time MR  Real time Data Pipeline developed in Facebook, to be open source soon ◦ Utilize existing log aggregation pipeline (Scribe-HDFS) ◦ Extend low-latency capabilities of HDFS (Sync+PTail) ◦ High-throughput writes (HBase)  Support for real time reliable aggregation: Unique user count, most frequent elements ◦ Utilize HBase atomic increments to maintain roll-ups ◦ Complex HBase schemas for unique-user calculations ◦ Store checkpoint information directly in HBase  Multiple Group-By operations per log line  The first key in Group-By is always time/date-related  New 2 versions ◦ Puma 2  Simple ◦ Puma 3  Better performance  PQL – Puma Query Language Log Stream Aggregations Storage Serving
  54. 54. Puma2: Real-Time MR  Map phase with PTail ◦ PTail provide parallel data streams ◦ Divide the input log stream into N shards ◦ 1st version only supported random bucketing ◦ Now supports application-level bucketing  Reduce phase with HBase ◦ HBase does single increment on multiple columns ◦ Every row+column in HBase is an output key ◦ Aggregate key counts using atomic counters ◦ Also maintain per-key lists or other structures PTail Puma2 HBase Serving
  55. 55. Puma3: Real-Time MR  Puma3 is sharded (split) by aggregation key.  Each shard is a hash map in memory.  Each entry in hash map is a pair of an aggregation key and a user-defined aggregation.  HBase as persistent key-value storage. PTail Puma3 HBase Serving Write workflow Checkpoint workflow Read workflow Join  Unique Counts Calculation ◦ Adaptive sampling ◦ Bloom filter (future)  Most frequent item (future) ◦ Lossy counting ◦ Probabilistic lossy counting Special Aggregations
  56. 56. Amazon Kinesis  Kinesis scales for real-time processing of streaming big data;  Kinesis requires that a user create at least two applications—a “Producer” and a Kinesis application (also called a “Worker”)— using Amazon’s Kinesis APIs;  The “Producer” takes data from some source and converts it into a "Kinesis Stream," a continuous flow of 50-kilobyte data chunks sent in the form of HTTP PUTs;  The "Worker" takes the data from the Kinesis Stream and does whatever processing is required;  The Kinesis application can run on any type of Amazon EC2 instance, and Kinesis will auto-scale the instances to handle varying streaming loads;  The Kinesis SDK libraries, used to create Kinesis Producers and applications, only available for Java, but write Kinesis applications in any language by simply calling the Kinesis APIs directly;  Stream output is sent to Amazon’s S3, DynamoDB, or Redshift;  Kinesis can create DAGs of Kinesis applications and data streams. Page 58
  57. 57. Amazon Kinesis Page 59
  58. 58. MillWheel: FT Stream Processing  Framework for building low-latency data-processing applications;  Part of Google Cloud Dataflow that also has Google Pub-Sub and FlumeJava;  Similar functionality to Apache Storm;  Specifying a directed computation graph and application code for individual nodes, and it manages persistent state and the continuous flow of records, with fault-tolerance; ◦ Called “transformations computations”; ◦ Computation code includes contacting external systems, manipulating other MillWheel primitives, or outputting data, once getting the input data; ◦ I/O are represented by (key, value, timestamp) triples; ◦ Keys are the primary abstraction for aggregation and comparison between different records; ◦ Streams are the delivery mechanism between different computations; ◦ Persistent state is an opaque byte string that is managed on a per-key basis; ◦ The low watermark for a computation provides a bound on the timestamps of future records arriving at that computation; ◦ Timers are per-key programmatic hooks that trigger at a specific wall time or low watermark value. ◦ FT: delivery guarantees, state manipulation.
  59. 59. Apache: Distributed Batch/Stream Processing  A distributed streaming dataflow engine written in Java and Scala;  To process big data quickly with low data latency and high fault tolerance on distributed systems on a large scale;  Previously as Stratosphere, changed to Flink (Apache incubator).  Support stateful stream processing; ◦ Asynchronous Barrier Snapshotting (ABS), a lightweight algorithm suited for modern dataflow execution engines that minimises space requirements; ◦ ABS persists only operator states on acyclic execution topologies while keeping a minimal record log on cyclic dataflows;  Maintain linear scalability and performing well with frequent snapshots; ◦ Stream barriers: injected into the data stream and flow with the records as part of the data stream; ◦ Operators that receive more than one input stream need to align the input streams on the snapshot barriers; ◦ Operators snapshot their state at the point in time when they received all snapshot barriers from their input streams, before emitting the barriers to their output streams.
  60. 60. Apache: Distributed Batch/Stream Processing
  61. 61. Apache: Distributed Batch/Stream Processing
  62. 62. Apache: Distributed Batch/Stream Processing Record acker (Storm) Micro-batching (Spark Streaming, Trident) Transactional updates (Google Cloud Dataflow) Distributed snapshots (Flink) Guarantee At least once Exactly once Exactly once Exactly once Latency Very Low High Low (delay of transaction) Very Low Throughput Low High Medium to High ( throughput of distributed transactional store) High Comput. model Streaming Micro-batch Streaming Streaming Overhead of FT mechanism High Low Throughput of distributed transactional store Low Flow control Problematic Problematic Natural Natural Separation of app. logic from FT Partially (timeouts matter) No (micro batch size) Yes Yes
  63. 63. Comparison of Puma, Storm, S4
  64. 64. NoSQL- Not Only SQL  Class of non-relational data storage systems (non-RDBS)  Usually do not require a fixed table schema nor use concept of joins  All NoSQL offerings relax one or more of the BASE/ACID properties ◦ Strong: ACID(Atomicity Consistency Isolation Durability) ◦ Weak: BASE(Basically Available Soft-state Eventual consistency)  Three major papers for NoSQL movement ◦ BigTable (Google) ◦ Dynamo (Amazon)  Gossip protocol (discovery and error detection)  Distributed key-value data store  Eventual consistency ◦ CAP Theorem: consistency, availability and partitions  NoSQL solutions fall into two major areas ◦ Key/Value or ‘the big hash table’.  Amazon S3 (Dynamo) ◦ Schema-less in multiple flavors, column, document or graph-based.  Cassandra (column-based)  CouchDB (document-based)  Neo4J (graph-based)  HBase (column-based) Consistency Partition tolerance Availability
  65. 65. BigTable at Google  A Bigtable is a sparse, distributed, persistent multi-dim. sorted map  The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.  Rows with consecutive keys are grouped together as “tablets”.  Column keys are grouped into sets called “column families”, which form the unit of access control.  A column key is named using the following syntax: family:qualier.  Bigtable uses the distributed Google File System (GFS) to store log and data files.  The Google SSTable file format is used internally to store Bigtable data. ◦ An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.  Bigtable relies on a persistent distributed lock service ◦ Chubby (a name space).
  66. 66. Dynamo: Key-value Store  Distributed, Highly available storage system from Amazon; ◦ SLA: Application can deliver its functionality in abounded time.  Simple interface associated with a key: get(key) and put(key, data) ◦ Binary objects (data<1MB) identified by a unique key.  Partitioning for scale incrementally ◦ consistent hashing: the output range of a hash function is treated as a “ring”. ◦ “virtual node” : Each node can be responsible for more than one virtual node.  Replication for high availability and durability ◦ “preference list”: The list of nodes that is responsible for storing a particular key.  Data versioning: vector clocks to capture causality between versions; ◦ A vector clock is a list of (node, counter) pairs.  Execution of get ()/put (): client, coordinator  Handling failures: “sloppy quorum” and hinted handoff  Replica synchronization: anti-entropy protocol using Merkle (hash) tree  Membership/Failure Detection ◦ Gossip-based protocol  Implementation: Java and APIs over HTTP ◦ BDB or MySQL;  Note: Amazon S3 service powered by Dynamo.
  67. 67. Cassandra  A Decentralized Structured Storage System;  Design goals: ◦ High availability, eventual consistency, incremental scalability, optimistic replication, “knobs” to tune tradeoffs between consistency, durability and latency, low total cost of ownership, and minimal administration;  Architecture:  Each node communicates with each other through the Gossip protocol, which exchanges information across the cluster every second;  A commit log on each node to capture write activity; data durability assured  Data also written to an in-memory structure (memtable) and then to disk once the memory structure is full (an SStable);  It is a row-oriented, column structure;  A key space is akin to a database in the RDBMS world;  A column family is similar to an RDBMS table but is more flexible/dynamic;  A row in a column family is indexed by its key; other columns may be indexed  Cassandra ~= bigtable + dynamo
  68. 68. HBase  HBase is an open-source, distributed, column-oriented database built on top of HDFS based on Google BigTable; ◦ Part of the Hadoop ecosystem (written in Java) ◦ Native connections to Map-Reduce  HBase by default manages a ZooKeeper instance as the authority on cluster state;  Structures data as tables of column-oriented rows ◦ Large, variable, number of columns per row ◦ Rows stored in sorted order ◦ Region: contiguous set of sorted rows, made of Stores ◦ Table: split roughly into equal sized regions ◦ RegionServer: keeps log of every update, manage region split ◦ Master: assigns Table Regions to RegionServers ◦ MemStore: Holds in-memory modifications to the Store.  HBase is not fully ACID-compliant  Can random read and write (no built-in joins) ◦ Single row operations (put, get, scan) ◦ Multiple row operations (scan, multiPut)
  69. 69. HBase Architecture
  70. 70. NewSQL: The new way to handle big data  A set of various new scalable/high-performance SQL database vendors (or databases); ◦ Migrate existing applications to adapt to new trends of data growth; ◦ Develop new applications on highly scalable OLTP systems; ◦ Rely on existing knowledge of OLTP (online transaction processing) usage.  Technical characteristics; ◦ Scalable performance of NoSQL systems for OLTP read-write workloads; ◦ ACID (Atomicity, Consistency, Isolation, Durability) guarantees, support for transactions; ◦ SQL as the primary mechanism for application interaction. ◦ A non-locking concurrency control mechanism so real-time reads will not conflict with writes, thus cause them to stall. ◦ An architecture providing much higher per-node performance than available from traditional RDBMS solutions. ◦ A scale-out, shared-nothing architecture, capable of running on a large number of nodes without bottlenecks.  NewSQL categorization; ◦ New databases: improving by making non-disk (memory) or new kinds of disks (flash/SSD) the primary data store. ◦ New MySQL storage engines: Storage engines developed, Xeround, Akiban, MySQL NDB cluster, GenieDB, Tokutek, etc. ◦ Transparent clustering: Cluster transparently, ensure scalability to provide transparent sharding to improve scalability.  Google Spanner is an example of NewSQL DB.
  71. 71. Spanner: Google’s Globally- Distributed Database  Scalable, multi-version, globally distributed, and synchronously-replicated database;  Scale up to millions of machines across hundreds of datacenters and trillions of database rows;  Reshard data across machines as the data or the servers changes, and automatically migrates data across machines (or datacenters) to balance load and in response to failures; ◦ A Spanner deployment is called a universe; Spanner is organized as a set of zones; ◦ A zone has one zonemaster and between one hundred and several thousand spanservers; ◦ Directory (a bucketing abstraction): a set of contiguous keys that share a common prefix; ◦ A directory is the unit of data placement, also the smallest unit with geographic replication properties;  A data model based on schematized semi-relational tables, general-purpose transactions, externally consistent reads/writes and SQL-based query language;  Replication for global availability and geographic locality, and clients failover between replicas; ◦ A tablet is similar to Bigtable’s tablet abstraction and a single Paxos state machine on top of each tablet; ◦ Each spanserver has a lock table to implement concurrency control;  Data is stored in schematized semi-relational tables, versioned, time-stamped; ◦ TrueTime API, implemented by a set of time master machines per datacenter and a timeslave daemon per machine.
  72. 72. Spanner: Google’s Globally- Distributed Database Spanner server organization Spanserver software stack TrueTime architecture
  73. 73. Kafka  A distributed publish-subscribe messaging system;  Maintains feeds of messages in categories called topics: a category or feed name to which messages are published;  Each partition is an ordered, immutable sequence of messages that is continually appended to a commit log; ◦ Each partition has one server which acts as the "leader" and zero or more servers which act as "followers";  Messaging traditionally has two models: queuing and publish-subscribe; ◦ Publish messages in the process to a Kafka topic producers; ◦ Subscribe to topics and consume the published messages by pulling: consumer;  Kafka cluster comprised of one or more servers called a broker to store published messages.  Efficiency on a single partition ◦ A very simple storage: log == list of files, message addressed by a log offset; ◦ Efficient data transfer: No message caching, zero-copy transfer, FS buffering; ◦ Stateless broker: Consumer maintains its own state, SLA-based retention.  Distributed coordination: Auto load balancing ◦ Make a partition within a topic the smallest unit of parallelism; ◦ No central “master” node, but ZooKeeper helps for a consensus service;  Delivery guarantees: messages in order delivered to a consumer. ◦ Built-in replication to store each message in multi brokers.
  74. 74. Kafka Cluster
  75. 75. Samza  Samza is a stream processing framework on the top of Hadoop (MRv2.0) ◦ Simple API: a very simple call-back based "process message" API; ◦ Managed state: snapshotting and restoration; ◦ Fault tolerance: work with YARN to migrate your tasks; ◦ Durability: uses Kafka to guarantee no messages get lost; ◦ Scalability: partitioned and distributed at every level; ◦ Pluggable: provides a pluggable API to run Samza with other messaging systems; ◦ Processor isolation: works with YARN, to give security and resource scheduling.  Concepts in Samza: ◦ Streams: immutable messages of a similar type or category; ◦ Jobs: code that performs a logical transformation on a set of input streams ; ◦ Partitions: each partition in the stream is a totally ordered sequence of messages; ◦ Tasks: the unit of parallelism of the job; ◦ Dataflow graphs: nodes - streams containing data, edges - jobs performing transformations ◦ Containers: unit of physical parallelism to runs one or more tasks.  Samza architecture: a stream processing built with Kafka and YARN ◦ Streaming: Kafka ◦ Executation: YARN ◦ Processing: Samza API ◦ Uses YARN and Kafka to provide stage-wise stream processing /partitioning.
  76. 76. Samza Concepts and Architecture
  77. 77. Machine Learning  “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” ◦ Supervised model: labeled data; ◦ Unsupervised model: unlabeled data; ◦ Semi-supervised model: both labeled and unlabeled data; ◦ Reinforcement Learning: learn by interacting with an environment.  Types of ML algorithms ◦ Prediction: predicting a variable from data ◦ Classification: assigning records to predefined groups ◦ Clustering: splitting records into groups based on similarity ◦ Association learning: seeing what often appears together with what  Relationship with others ◦ Artificial intelligence: emulate how the brain works with programming;  ML is a branch of AI ◦ Data mining: building models in order to detect the patterns; ◦ Statistical analysis: probabilistic models, on which to infer using data; ◦ Information retrieval: retrieval of information from a collection of data (doc).
  78. 78. Some Issues in ML  Training/testing data (70%/30%)  Data unbalanced (one class’ data more than others) ◦ Sampling, learning algorithm modification (cost-sensitive), ensemble,…  “Open set” (how to handle unknown or unfamiliar classes);  Feature extraction ◦ Sparse coding, vector quantization,…  Curse of Dimensionality: Sensitivity to “noise” ◦ Dimension reduction, manifold learning/distance metric learning  Linear or non-linear model ◦ Local/Global minimum (convex/concave obj. function): Learning rate ◦ Regularization: L-1/L-2 norm ◦ Kernel trick: mapping nonlinear feature space to high dim. linear space  Discriminative or generative model ◦ Bottom up (conditional distribution) /Top down (joint distribution)  Over-fitting: Learn the “noise” ◦ Cross validation with grid search  Vanishing gradient and sensitivity of initialization  Performance evaluation ◦ Precision/Recall, confusion matrix, ROC, i.e. receiver operat. characteristic)
  79. 79. “Data Unbalancing” Issue in ML  Resampling methods for balancing the data set.  Over-sampling, under-sampling, importance sampling;  Modification of existing learning algorithms.  Cost-sensitive learning;  One class classification;  Classifier ensemble (bagging, boosting, random forest…)  Measuring the classifier performance in imbalanced domains.  ROC, F-measure,…  Relationship between class imbalance and other data complexity characteristics.
  80. 80. “Open Set” Issue in ML  How to handle unknown or unfamiliar classes?  Label as one of known classes or as unknown;  Zero shot learning/unseen class detection;  Novelty detection with null space methods;  One class SVM;  Multiple classes:  Artificial super class from all given classes;  Combine several one class classifiers learned separately;  K-nearest neighbors;
  81. 81. “Curse of Dimensionality” in ML  Curse of dimensionality: distributing bins or basis functions uniformly in the input space may work in 1 dimension, but will become exponentially useless in higher dimensions;  Learning a "state-of-nature" from a number of samples in a high-dimensional feature space with each feature having a number of possible values, enormous amount of training data are required to ensure that there are several samples with each comb. of values;  With a fixed number of training samples, the predictive power reduces as the dimensionality increases, and this is known as the Hughes effect or Hughes phenomenon;  How to avoid it?  Dimension reduction: PCA, LDA, MDS;
  82. 82. “Over-fitting” Issue in ML  A statistical model describes ”noise” instead of the underlying relationship;  Over-fitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations;  A model which has been over-fit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data;  How to avoid over-fitting?  Explicitly penalize overly complex models;  Test the model's ability to generalize by evaluating its performance on a set of data not used for training, which is assumed to approximate the typical unseen data that a model will encounter;  Methods: cross-validation, regularization, early stopping, pruning, Bayesian priors on parameters or model comparison;
  83. 83. Large Scale Machine Learning  Data independent model ◦ Assumes each data instance can be independently computed, such as Hadoop  Data locally dependency model ◦ Assumes many data vertices locally connected with its neighbor vertices, each data vertex updates its own status in parallel according to the status of its connected neighbors vertices, like GraphLab.  Deep learning: Learn multi-layers of data represent. (features) ◦ MLP, CNN, DBN, DBM, SDAE,…; ◦ Unsupervised pre-training + supervised fine tuning; ◦ Dropout, data augmentation for avoiding overfitting;  Online ML: fast and memory‐efficient ◦ Stochastic/incremental gradient descent (SGD); ◦ Adaptive learning rate and momentum;  Ensemble learning: easy to be distributed, scalable ◦ Boosting, bagging, stacking, random forest,…  Open Sources: Mahout, R, WEKA, MLPack, MLBase,…  ML on parallel machines ◦ GPU, cloud or cluster (distributed), multi-core,…
  84. 84. Deep Learning  Representation learning attempts to automatically learn good features or representations;  Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction (intermediate and high level features);  Become effective via unsupervised pre-training + supervised fine tuning; ◦ Deep networks trained with back propagation (without unsupervised pre-training) perform worse than shallow networks.  Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);  Semi-supervised: structure of manifold assumption; ◦ labeled data is scarce and unlabeled data is abundant.
  85. 85. Why Deep Learning?  Supervised training of deep models (e.g. many-layered Nets) is too hard (optimization problem); ◦ Learn prior from unlabeled data;  Shallow models are not for learning high-level abstractions; ◦ Ensembles or forests do not learn features first; ◦ Graphical models could be deep net, but mostly not.  Unsupervised learning could be “local-learning”; ◦ Resemble boosting with each layer being like a weak learner  Learning is weak in directed graphical models with many hidden variables; ◦ Sparsity and regularizer.  Traditional unsupervised learning methods aren’t easy to learn multiple levels of representation. ◦ Layer-wised unsupervised learning is the solution.  Multi-task learning (transfer learning and self taught learning);  Other issues: scalability & parallelism with the burden from big data.
  86. 86. Trade-off in Large Scale ML  Small scale vs Large scale ◦ We have a small-scale learning problem when the active budget constraint is the number of examples 𝑛. ◦ We have a large-scale learning problem when the active budget constraint is the computing time 𝑇.  Statistical Perspective ◦ It is good to optimize an objective function that ensures a fast estimation rate when the number of examples increases.  Optimization Perspective ◦ To efficiently solve large problems, it is preferable to choose an optimization algorithm with strong convergence properties.  Incorrect Conclusion ◦ To address large-scale learning problems, use the best algorithm to optimize an objective function with fast estimation rates  Learning with approximate optimization ◦ Stochastic gradient descent (historically associated with BP)
  87. 87. Some Issues in Large Scale ML  Job scheduling ◦ Schedule and monitor “batch” jobs;  Parallel execution ◦ Distributed ◦ SIMD;  Auto Scaling ◦ Scale up (vertical) ◦ Scale out (horizontal)  Monitoring  Fault-tolerant ◦ Failover ◦ recover  Load balancing ◦ Distribute work load across the cluster  Work flow management ◦ Choreography ◦ Orchestration
  88. 88. Page 90 Job Scheduling  A job scheduler is a program that enables an enterprise to schedule and, in some cases, monitor computer "batch" jobs (units of work, such as the running of a payroll program).  A job scheduler can initiate and manage jobs automatically by processing prepared job control language statements or through equivalent interaction with a human operator.  Functions:  Avoid starvation;  Maximize throughput;  Minimize response time;  Optimal use of resources.  Hadoop: FIFO, FAIR (Facebook), Capacity, Dynamic Priority Schedulers.
  89. 89. Page 91 Auto-Scaling  Auto-scaling: scales up/down when the load increases/ decreases, ability to handle increasing amount of work gracefully;  Vertical scalability:  Scaling Up: maintain performance levels as concurrent request increases;  Horizontal scalability:  Scaling Out: meet demand through replication, across a pool of servers;  Dimensions  Load. Handling increasing load by adding resources;  Geographic. Maintain perf. in case geographically distributed systems;  Functional. Adding new features using minimum effort.  Amazon’s Cloud Watch: EC2 (CPU, Disk/Network I/O), ELB.
  90. 90. Page 92 Load Balancing  Load balancing distributes workload across one or more servers, network interfaces, hard drives, or other computing resources;  A load balancer provides the means by which instances of applications can be provisioned and de-provisioned automatically, without requiring changes to the network or its configuration;  Determine the maximum connection rate that the various solutions are capable of supporting;  Failover: continuation of the service after the failure;  Amazon Elastic Load Balancer (ELB): it facilitates distributing incoming traffic among multiple AWS instances (like HAProxy);  Span Availability Zones (AZ), and can distribute traffic to different AZs;
  91. 91. Page 93 Workflow Management  Workflow is loosely-coupled parallel application, consists of a set of computational tasks linked via data/control-flow dependencies;  how tasks are structured, who performs them, what their relative order is, how they are synchronized, how information flows to support the tasks and how tasks are being tracked.  An activity is a discrete step in a business process (workflow); Activities are orchestrated together in a workflow;  “Service choreography” – description of coordination between two/more parties .  “Service orchestration” – business process is modeled using workflows.  Amazon Simple Workflow (SWF): task coordination and state management for cloud apps;  Twitter Azkaban: A workflow scheduler that allows the independent pieces to be declaratively assembled into a single workflow.
  92. 92. Apache Spark  Spark originally written in Scala, added with Java and Python API;  Resilient distributed dataset (RDD): Collection of elements partitioned across nodes of cluster operated in parallel; ◦ Parallelized collections with partitions, take an existing Scala collection and run functions on it in parallel; ◦ Stores distributed datasets in HDFS from any file, run functions on each record of a file in HDFS or supported by Hadoop; ◦ Two types of operations:  Transformations: create a new dataset from an existing one,  Action: return a value to the driver program after comput. on the dataset.  Shared Variables (two types): used in parallel operations ◦ broadcast variables, cache a value in memory on all nodes; ◦ accumulators, only “added” to, such as counters and sums.
  93. 93. Apache Spark  Shuffle: transfer data around the cluster to new partitions; ◦ Data partition, serialization/deserialization, compression and disk IO.  Spark Scheduler: DAG into Stages of a series of Tasks. ◦ DAGScheduler: to divide the received job into different stages and the corresponding tasks, and then submit the tasks to TaskSchedule; The relationship between different stages forms DAG; ◦ TaskScheduler: accepts and executes tasks. It has three sub-modules: ClusterScheduler, LocalScheduler, and MesosScheduler.  Fault tolerance: ◦ RDDs track the series of transformations used to build them (their lineage) to recompute lost data  How to tune the Spark jobs? ◦ Tune the CPU and memory; ◦ Tune the data structure: record and Kryo serializer; ◦ Tune the parallelism: number of partitions.
  94. 94. Transformations/Actions Description Map(function f1) Pass each element of the RDD through f1 in parallel and return the resulting RDD. Filter(function f2) Select elements of RDD that return true when passed through f2. flatMap(function f3) Similar to Map, but f3 returns a sequence to facilitate mapping single input to multiple outputs. Union(RDD r1) Returns result of union of the RDD r1 with the self. Sample(flag, p, seed) Returns a randomly sampled (with seed) p percentage of the RDD. groupByKey(noTasks) Can only be invoked on key-value paired data – returns data grouped by value. No. of parallel tasks is given as an argument (default is 8). reduceByKey(function f4, noTasks) Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the second argument. Join(RDD r2, noTasks) Joins RDD r2 with self – computes all possible pairs for given key. groupWith(RDD r3, noTasks) Joins RDD r3 with self and groups by key. sortByKey(flag) Sorts the self RDD in ascending or descending based on flag. Reduce(function f5) Aggregates result of applying function f5 on all elements of self RDD Collect() Return all elements of the RDD as an array. Count() Count no. of elements in RDD take(n) Get first n elements of RDD. First() Equivalent to take(1) saveAsTextFile(path) Persists RDD in a file in HDFS or other Hadoop supported file system at given path. saveAsSequenceFile(path) Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs that implement Hadoop writable interface or equivalent. foreach(function f6) Run f6 in parallel on elements of self RDD. Apache Spark
  95. 95. Apache Spark  Spark streaming: Large scale stream processing framework.  Spark persists (caches) a dataset in memory across operations  Spark can run at Amazon Elastic MapReduce;  Mllib: Implementation of some Machine Learning functionality, as well associated tests and data generators; ◦ Binary classification: SVM, Naïve Bayes and Logistic Regression; ◦ Linear regression: SGD; ◦ Clustering: k-mean++, SVD; ◦ Collaborative filtering for recommender system: Alternating LS ◦ Gradient Descent and Stochastic GD  Bagel: Implement. of Google’s Pregel; ◦ GraphX: unified graph analytics, to replace Bagel; ◦ GAS (Gather/Appy/Scatter): splits vertex programs in 3 data-parallel stages;  Shark: Port of Apache Hive to run on Spark.
  96. 96. Data Flow in Spark
  97. 97. Spark Streaming  Spark Streaming: Run a streaming computation as a series of very small, deterministic batch jobs; ◦ Scales to hundreds of nodes ◦ Achieves low latency ◦ Efficiently recover from failures ◦ Integrates with batch and interactive processing  Programming model: Dstream (Discretized Stream) ◦ Represents a stream of data ◦ Implemented as a sequence of RDDs  Languages: Scala API, Java API, Python API  DStream + RDDs = POWER; ◦ Combine live data streams with historical data; ◦ Combine streaming with MLlib, GraphX algos; ◦ Query streaming data using SQL;  FT: replicated in memory, recomputed from replicated data.
  98. 98. Spark Runtime and Spark Streaming
  99. 99. Mahout - Scalable ML • Apache Software Foundation Java library; • Scalable “machine learning“ and “data mining“ library that runs on Apache Hadoop mostly using map/reduce (M/R) paradigm; • Currently Mahout supports 3”C”s+Extras use cases: • Collaborative Filtering for recommendation: • Non-distributed: Taste • Distributed: user-based or item-based on Hadoop • Collaborative filter with matrix factorization • Classification: Perceptron, RBM, Winnow, Logistic regression, Naïve Bayes, Complementary NB, HMM, boosting, random forest, SVM, … • Clustering: Canopy, k-means, mean shift, Dirichlet process, spectral, min-hash, hierarchical, LDA (Latent Dirichlet Allocation), EM,… • Frequent Patten Mining: Parallel FP-Growth (PFP); • Locally Weighted Linear Regression; • SVD (singular value decomposition), PCA, ICA,... • Evolutionary algorithm: genetic algorithm, .... • Mahout can produce vector represent. from Lucene (Solr) index; • Run Mahout at Amazon EC2 and Amazon EMR.
  100. 100. Mahout ML Algorithms Math Vectors/Matrices/ SVD RecommendersClusteringClassification Freq. Pattern Mining Genetic Utilities Lucene/Vectorizer Collections (primitives) Apache Hadoop Applications Examples
  101. 101. Jubatus: real-time ML  Real-time and highly-scalable ML platform from NTT (Japan)  Online learning algorithms in Jubatus ◦ Linear classification  Perceptron/Passive Aggressive/Confidence Weighted Learning/Soft CWL/AROW/Normal HERD ◦ Regression: PA-based regression ◦ Nearest neighbor: LSH/Min-Hash/Euclid LSH ◦ Recommendation: Based on nearest neighbor ◦ Anomaly detection: LOF based on nearest neighbor ◦ Graph analysis: Shortest path/Centrality (PageRank) ◦ Simple statistics  Why Jubatus? ◦ Online learning requires frequent model updates ◦ Naïve distributed architecture leads to too many synchronization operations  Solution in Jubatus: Loose model sharing  Basic operations in Jubatus: only work locally to realize real-time ◦ UPDATE: local model is updated by each input sample, never shared! ◦ ANALYZE: each sample randomly go the server and then result goes back the client; ◦ MIX: send out model difference which is merged and distributed.  Everything in the memory (process the data on the fly)
  102. 102. Jubatus Realtime applicationBatch application Simple Analysis (Statistics) Batch(Stored) Big Data In-depth Analysis (classification, estimation, prediction) Real time(Online)
  103. 103. GraphLab in C++  A graph-based, high performance, distributed computation framework for parallel Machine Learning, from Select Lab, CMU;  A unified multicore and distributed API;  Scalable: data and computation;  Access data directly from HDFS;  Data graph is graph with data associated with every vertex/edge;  Update Functions are operations applied on a vertex and transforming data in the scope of the vertex;  Scheduler determines order of update Function evaluations; ◦ Static: Synchronous or round robin schedule; ◦ Dynamic: Update functions insert new tasks into the schedule.  Shared Data table: global constant parameters can be stored;  Sync operation, similar to MR’s “reduce”: accumulate and apply;  Ensures race-free operation;  Guarantees sequential consistency;  Dato: a startup based on GraphLab Create, a ML platform.
  104. 104. GraphLab in C++  Designed specifically for ML needs; ◦ Express data dependencies; ◦ Iterative;  Simplifies the design of parallel programs; ◦ Abstract away HW issues; ◦ Auto data synchronization; ◦ Addresses multiple HW architectures;  Multi cores now;  Distributed and GPU version in progress;  GraphLab Create™ is a Python library: ◦ Scalable machine ML modules such as boosted decision trees, deep learning and text analytics; ◦ built-in integration with Spark, Hadoop, Apache Avro and OBDC; ◦ Integration with AWS EC2 and S3 for scalable compute and storage;  PowerGraph: A framework for large-scale ML and graph comput.
  105. 105. GraphLab Model Data Graph Shared Data Table Scheduling Update Functions and Scopes GraphLab Model
  106. 106. BSP Model  Bulk Synchronous Parallel model (Leslie Valiant, 1990);  Advantages over MapReduce and MPI: ◦ Supports message passing paradigm style of application development ◦ Provides a flexible, simple, and easy-to-use small APIs ◦ Perform better than MPI for communication-intensive applications ◦ Guarantees impossibility of deadlocks or collisions in the communication mechanisms  A BSP computation proceeds in a series of global supersteps.  A superstep consists of three components: ◦ Concurrent computation. Each process can perform using local data values. ◦ Communication. The processes exchange data between themselves. ◦ Barrier synchronization. Waits until all other processes have finished communication actions.
  107. 107. BSP Model superstep
  108. 108. Pregel for Graph Computing  It is a master/slave model for large-scale graph processing;  BSP model-based;  Vertex-centric model: for each vertex ◦ An arbitrary “value” that can be get/set. ◦ List of messages sent to it ◦ List of outgoing edges (edges have a value too) ◦ A binary state (active/inactive)  Combiners: ◦ Sometimes vertices only care about a summary value for the messages; ◦ Combiners allow for this (examples: min, max, sum, avg) ◦ Messages combined locally and remotely  Aggregators: ◦ Compute aggregate statistics from vertex-reported values ◦ During a superstep, each worker aggregates values from its vertices ◦ At the end of a superstep, partially aggregated values in a tree structure  Fault tolerance: save at the checkpoint and recover if necessary  Confined recovery: ◦ Failed worker “catch-up” to the rest, and other workers resend messages to it (under development?)
  109. 109. Apache Giraph  Developed at Yahoo! and used by Facebook (open source of Pregel);  Reuse Hadoop as Map-Reduce job;  focuses on graph-based bulk synchronous parallel (BSP) computing;  ZooKeeper: responsible for computation state; ◦ partition/worker mapping ◦ checkpoint paths, aggregator values, statistics  Master: responsible for coordination ◦ assigns partitions to workers ◦ coordinates synchronization ◦ requests checkpoints ◦ aggregates aggregator values ◦ collects health statuses  Worker: responsible for vertices ◦ invokes active vertices compute() function ◦ sends, receives and assigns messages ◦ computes local aggregation values
  110. 110. Apache Hama  General purposed BSP computing, not for graph processing only; ◦ Job management / monitoring ◦ Checkpoint recovery ◦ Pluggable message transfer architecture  Written In Java;  Local & (Pseudo) Distributed run modes  MapReduce like I/O API  Supports to run in the Clouds using Apache Whirr;  Supports to run with Hadoop 2.0: YARN;  Graph API  Application besides of graph processing: Machine learning:  Collaborative filtering  Clustering: k-means  Gradient descent for training in classification
  111. 111. Apache Hama
  112. 112. Vowpal Wabbit @ Yahoo/MS  Scalable, fast, efficient linear ML engine written in C/C++  Hadoop compatible AllReduce (not MPI style) ◦ “Map” job moves program to data; ◦ Read (and cache) all data, before initializing AllReduce; ◦ Use map-only Hadoop for process control and error recovery. ◦ Use AllReduce code to sync state; ◦ Always save input examples in a cache file to speed later passes; ◦ Use hashing trick to reduce input complexity.  Algorithms in VW 7.0 ◦ Binary classification and regression ◦ Multiclass classification, NN, active learning  Reduction: One Against All or Cost-Sensitive OaA or Sequence Prediction ◦ Latent Dirichlet Allocation, and matrix factorization ◦ Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS); ◦ Stochastic Gradient Descent (SGD) and CG;  Online learning/Active learning: no need to load all data into memory ◦ Dimension correction  Feature caching (adaptive learning)  Feature hashing (importance update)
  113. 113. AllReduce in VW Every node begins with a number (vector) Every node ends up with the sum; Ideal to sum local gradients, weights; Creates a spanning tree over the nodes; At each iteration Nodes compute gradient over local data; AllReduce computes gradient over entire data. Each node receives a subset of data;
  114. 114. Trident-ML  A real-time online ML library using scalable online algorithms;  Built on top of Storm, running on a cluster of machines and supports horizontal scaling;  Based on Trident, a high-level abstraction of Storm;  Trident-ML currently supports: ◦ Linear classification (Perceptron, Passive-Aggresive, Winnow, AROW, KLD for text) ◦ Linear regression (Perceptron, Passive-Aggresive) ◦ Clustering (KMeans) ◦ Feature scaling (standardization, normalization) ◦ Text feature extraction ◦ Stream statistics (mean, variance) ◦ Pre-Trained Twitter sentiment classifier  Trident-Ml is hosted on Clojars (a Maven repository).  Trident-ML process unbounded streams of data implemented by an infinite collection of Instance or TextInstance.
  115. 115. Storm-Pattern  Based on Cascading’s sub-project, Pattern, which uses flows as containers for ML models; ◦ Cascading is a de-facto Java application framework that enables easy large scale developing work of rich Data Analytics and Data Management applications with Apache Hadoop and its API.  Importing PMML (Predictive Model Markup Language) model descriptions from R, SAS, Weka, RapidMiner, KNIME,SQL Server, etc. ◦ PMML, an XML-based file format developed by the Data Mining Group;  Working in tandem with the Lingual JDBC driver, modeling tools can pull data directly off Hadoop to train or test model quality; ◦ Lingual executes SQL queries as Cascading app. on Hadoop clusters.  Current support for PMML includes: ◦ Random Forest in PMML 4.0+ exported from R/Rattle; ◦ Linear Regression in PMML 1.1+; ◦ Hierarchical Clustering and K-Means Clustering in PMML 2.0+; ◦ Logistic Regression in PMML 4.0.1+.
  116. 116. Samoa: Big Data Mining  Scalable Advanced Massive Online Analysis; ◦ A platform of stream data mining on S4/Storm from Yahoo, to be open source soon;  Short term algorithms: ◦ Hoeffding tree, K-means, Gradient boosted decision tree;  Long term algorithms: ◦ Integrate with add-ons packages (like R); ◦ Implement most common ML methods (like Mahout).  Algorithmic implement: Horizontal/Vertical data parallelism;  Platform design: ML Developer API;  Deployment and runtime.
  117. 117. Apache Samoa Runtime
  118. 118. H2O: Killer App on Spark  Scalable open source machine learning and deep learning for smarter applications;  Fast and memory-efficient Java implementations based on columnar compression and fine-grain Map/Reduce;  H2O platform consists of many H2O nodes communicating in P2P manner;  A H2O cluster can be started in two ways: ◦ A client starts with a flat file containing address of its peers; ◦ A client requests a number of nodes from a cluster resource manager, i.e. YARN;  Multi-threaded and distributed parallel computation to be run on either a single node or a multi-node cluster;  Parallelism of SGD: HogWild! ◦ It follows a shared memory model where multiple cores, each handling separate subsets (or all) of the training data, are able to make independent contributions to the gradient updates asynchronously;  Distributed H2O Deep Learning for text classification at Ebay, fraud detection at Paypal.
  119. 119. Deeplearning4j: DL4J for Java  The first commercial-grade, open-source, distributed deep-learning library written for Java and Scala, integrated with Hadoop and Spark; ◦ A versatile n-dimensional array class; ◦ GPU integration; ◦ Scalable on Hadoop, Spark and Akka + AWS and other platforms;  Both a distributed multi-threaded deep-learning framework and a normal single-threaded deep-learning framework; Has implemented models as ◦ Restricted Boltzmann machines ◦ Convolutional Nets (images) ◦ Stacked Denoising Autoencoders ◦ Recurrent Nets/LSTMs (time series) ◦ Recursive autoencoders ◦ Deep-belief networks ◦ Deep Autoencoders (QA/information retrieval) ◦ Recursive Neural Tensor Networks (scenes, parsing)
  120. 120. M2C: Mobile Cloud System for DL  Three parts: cloud side, data transmission channel and mobile device; ◦ Spark + Android/iOS + Cuda;  Software stack: Amazon, 3G communication, HP LoadRunner, php-based.
  121. 121. SparkNet with Caffe at UC Berkeley  A framework for training deep networks in Spark.  Includes an interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multidimensional tensor library.  Using a simple parallelization scheme for SGD, SparkNet scales well with the cluster size and tolerates very high-latency communication.  Compatible with existing Caffe models.  Benchmark the system’s performance on the ImageNet dataset.
  122. 122. Caffe on Spark at Yahoo  CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.  Complementary to non-deep learning libraries MLlib and Spark SQL, and its data-frame style API provides Spark applications with a mechanism to invoke deep learning over distributed datasets.  Its server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck.  As a distributed extension of Caffe, CaffeOnSpark supports NN model training, testing, and feature extraction.  Caffe users can now perform distributed learning using their existing LMDB data files and minorly adjusted network configuration.  Early benchmark indicates 18x speedup for deep networks.  Be used by Yahoo for image search, content classification and several other use cases.  Open source under Apache 2.0 License:
  123. 123. Caffe on Spark at Yahoo
  124. 124. Data and Model Parallelism  Data Parallelism ◦ Data is divided across machines, is a common strategy for solving Big Data; ◦ the data D is partitioned and assigned to computational workers; ◦ Additive updates are the foundation for a host of techniques to speed up data- parallel execution such as mini-batch, asynchronous and bounded asynchronous execution, and parameter servers; ◦ Key to the validity of additivity of updates from different workers is the notion of independent and identically distributed (iid) data, which is assumed for many ML programs.  Model Parallelism: ◦ Divides the ML model is common for Big Models; ◦ The model is partitioned and assigned to workers, and updated therein in parallel; ◦ Each update function also takes a scheduling function which restricts update to operate on a subset of the model parameters; ◦ The model parameters partitioned are independent of each other, so model- parallel algorithms can only be effective if the parallel updates are restricted to independent or weakly correlated parameters; ◦ Global scheduling mechanism that can select carefully-chosen parameters for parallel updating. ◦ The scheduling function opens up a large design space, such as fixed, randomized, or even dynamically changing scheduling on the whole space, or a subset of, the model parameters.
  125. 125. Data and Model Parallelism  Data- and model-parallel programs are stateful, in that they continually update shared model parameters;  An ML platform needs to synchronize across all running threads & processes, done in a high- performance non-blocking manner that still guarantees convergence;  If the program is model- parallel, it may require fine control over parameter scheduling to avoid non- convergence.
  126. 126. Data and Model Parallelism in CNN  Data parallelism good for convolutional layer; ◦ Data parallelism in SGD = bigger mini-batches;  Double buffering: break each min-batch in half, then exchange sub-gradients of one half while computing those of the next half (delayed); ◦ Decreased frequency of updates cause lower convergence speed! ◦ Overhead: Communication cost of (or, part in update) parameter transfers; ◦ Computation per weight is dominant.  Model parallelism good for fully connected layer; ◦ Vertical Partitioning (Striping) ◦ Horizontal Partitioning (Pipelining) ◦ Convolution layers can run model parallelism based on splitting of filters! ◦ Overhead: Communication (feature and even gradient transfers) and Synchronization cost;  Project Adam only send activation & error gradient rather than weight update; ◦ Computation per neuron is dominant.  Optimal partitioning for model and data parallelism’s overhead minimization;  Data and model parallelism can be hybrid!
  127. 127. Peer-to-Peer Framework  Hadoop compatible AllReduce (implemented in MPI package); ◦ Every node starts with a number and ends up with the sum of the numbers across all the nodes; ◦ A tree structure on the communication nodes and proceeds in two phases: numbers are first summed up the tree (reduce) and then broadcast down to all the nodes (broadcast); ◦ Hybrid online + batch method to handle reliability of a single spanning tree-based AllReduce;  Online first on each node and then batch with AllReduce.  Butterfly mixing: interleave communication with computation; ◦ A butterfly network on 2^k nodes and execute one constant time communication step (a butterfly shuffle) for each computation step; ◦ In butterfly network, every node computes some function of its own value and the other node, output to itself and the other node, and every output node holds the average of the inputs; ◦ Butterfly mixing interleaves average operation with iterative updates (for instance, from SGD); ◦ Fully distributes model updates after k steps, but data from smaller subsets of nodes travels with lower latency, with a closer convergence rate with AllReduce; ◦ Only involve higher bandwidth, but OK with hybrid butterfly + multi-node communication.
  128. 128. Peer-to-Peer Framework  Different parallelization schemes (N = 4). (a) and (b) synchronize model before every gradient step, with tree-based and butterfly AllReduce respectively. They suffer from overwhelmingly high communication overhead. (c) reduces synchronization overhead without losing convergence performance by mixing model update with communication. (a) Tree AllReduce (b) Butterfly AllReduce (c) Butterfly mixing
  129. 129. FireCaffe: Near-linear Accelerat. of DNN Training on Clusters  Reduce communication overhead wherever possible, while not degrading the accuracy of the DNN models: ◦ Select network HW with high bandwidth between GPU servers: Infiniband or Cray interconnects; ◦ Reduction trees are more efficient/scalable than the traditional parameter server; ◦ Increase the batch size to reduce the total quantity of communication;  Identify hyperparas that allow for reproducing the small-batch accuracy while training with large batch sizes.  GoogLeNet/NIN on ImageNet, 16x / 23x speedup respectively, on a cluster of 32 GPUs.
  130. 130. Parameter Server Framework  Provide an efficient mechanism for aggregating and synchronizing model parameters and statistics btw workers; ◦ Each parameter server node maintains only a part of parameters, and each worker node typically requires only a subset of parameters when operating.  1st generation: YahooLDA, lack flexibility and performance;  2nd generation: DistBelief, Petuum, constraints on the worker threading model;  3rd generation: large scale problem solving with the general composite objective function;  General purpose distributed learning systems: Mahout, MLI;  Asynchronous distributed optimization: ◦ Shotgun (CD), Hogwild (SGD), NIPS (PS);  Insights: ◦ Many learning algorithms represent parameters as structured objects, such as vectors, matrices, or tensors, only a part of the object is updated at each time; ◦ Live replication of parameters between servers supports hot failover by dynamic scaling with machine’s removal or addition.
  131. 131. Parameter Server Framework  Parameter server nodes are grouped into a server group and several worker groups; A server node in the server group maintains a partition of the globally shared parameters; Server nodes communicate with each other to replicate and/or to migrate parameters for reliability and scaling.  There is a scheduler node for each worker group; It assigns tasks to workers and monitors their progress. • If workers are added or removed, it reschedules unfinished tasks; • The shared model parameters are sorted (key, value) pairs.
  132. 132. Parameter Server Framework • Goals of workflow scheduler: correctness, safety and speed-up. • Petuum: STRADS (STRucture-Aware-Dynamic Scheduler) • Schedule - Push - Pull for the SSP model; • CMU Parameter Server: flexible consistency model and asynchronous task dependency graph • Task scheduler for each worker group and server manager for user defined functions; • SINGA: optimizer for data/model partition and distributed in-memory parameter table • Basic APIs as get-collect-put-update to realize synchronous/asynchronous SGD in training CNN; • Google’s DistBelief: data and model parallelism for distributed asynchronous optimization • Load balancing in data partition and operations “Fetch – Push” for delayed update with coordinator; • MSR’s Project Adam: data and model parallelism with different protocols for Conv/FC layers • A data flow framework to mitigate slow machines’ impact and a PS controller for bucket management.
  133. 133. YahooLDA (Latent Dirichlet Allocation)  Three principal challenges: ◦ Synchronizing the global state; ◦ Storing and retrieving the large local state efficiently; ◦ Sequentially incorporate streaming data.  Scalable parallel framework in latent variable models ◦ A novel delta-based aggregation system for asynchronous updating global variables, with bandwidth efficient communication protocol; ◦ Schedule aware disk based cache out of core storage; ◦ An efficient online algorithm for latent variable inference in streaming data; ◦ Approximate forward sampling to rapidly incorporate new data; ◦ Fault tolerance and fast recovery for large distributed global state.  Note: LDA is a special case of pLSA (probabilistic Latent Semantic Analysis) with Dirichlet prior
  134. 134. DistBelief: Distributed Deep Networks  Utilize clusters with thousands of machines to train large models;  The performance benefits of distributing a deep network across multiple machines depends on the connectivity structure and computational needs of the model;  Two new optimization methods: ◦ Downpour SGD, an asynchronous stochastic gradient descent (A-SGD) procedure supporting a large number of model replicas (online), with Adagrad;  More robust to machines failures than standard (synchronous) SGD; ◦ Sandblaster, supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS (limited memory) with both data and model parallelism;  Coordination: issues commands drawn from a small set of operations performed by each parameter server shard independently, with the results being stored locally on the same shard;  Load balancing scheme: assigns each of the N model replicas a small portion of work, much smaller than 1/Nth of the total size of a batch, and assigns replicas new portions whenever they are free.  Train a deep network 30x larger than before, and achieves SOA performance on ImageNet.
  135. 135. Model parallelism Model parallelism + data parallelism DistBelief: Distributed Deep Nets at Google
  136. 136. Model parallelism in DistBelief Downpour SGD Sandblaster L-BFGS
  137. 137. Project Adam  Building a Scalable Distributed Deep Learning Training System; ◦ Achieves high efficiency and scalability that optimizes and balances workload;  Data serving machines: data augmentation to avoid overfitting;  Exploit asynchrony to improve perform. & accuracy of trained models; ◦ Multiple replicas asynch. updating a shared model via a global parameter server; ◦ Partition the models vertically across the model worker machines; ◦ Multi-threaded model parameter update locally without locks via a training context; ◦ Asynchronous batched parameter updates – weight updates are associative and commutative, permit the computation over stale parameters; ◦ Passing pointers in communication of neurons, to reduce the memory copies; ◦ Cache locality: two assembly kernels to pack and block the data for matrix multiply; ◦ Slow machine impact: process multiple images in parallel, end a training epoch before the image is completely processed with no impact on the model accuracy; ◦ Send activation & error grad. rather than wei. update to parameter server machine.  Global parameter server: communication with model training machines ◦ Model paras divided into 1MB sized shards, hashed into storage buckets distributed equal.; ◦ Use SSE/AVX instructions, lock free data structures for queue and hash tables, lock free memory allocation, three copies of each para for FT, and decoupling durability.
  138. 138. Parameter Server Node ArchitectureModel partitioning across training machines. Based on the Multi-Spert system, currently comprised of a cluster of 120 identical machines organized as three equally sized racks connected by IBM G8264 switches. A HP Proliant server with dual Intel Xeon E5-2450L processors for a total of 16 cores running at 1.8Ghz with 98GB of main memory, two 10 Gb NICs and one 1 Gb NIC.
  139. 139. Petuum: Iterative-Convergent Distributed ML  Petuum comes from "perpetuum mobile“, a musical style (continuous steady stream of notes);  general-purpose distributed platform for Big Machine Learning;  execute iterative updates in a manner that most quickly minimizes the loss function in a large-scale distributed environment;  Statistical: error-bounded consistency schemes to decrease network communication, rescheduling of updates to decrease correlation effects and optimizing load-balancing;  A parameter server (Petuum-PS) for global parameter synchronization; ◦ a distributed key-value store that provides client machines shared-memory access to global parameters sharded on the server machines; ◦ a table-based user interface: global tables accessed globally by a row-column ID pair; ◦ Bounded Consistency: Stale Synchronous Parallel (SSP) Consistency and Value-Bounded Consistency; ◦ Process-Level and Thread-Level Caching: cashing a frequently accessed row in table, i.e. Least-Recently Used (LRU), Two-List LRU, and Priority LRU; ◦ Out-of-Core Storage Support: efficiently streaming data from hard disks or SSDs.
  140. 140. Petuum: Iterative-Convergent Distributed ML  A structure-aware dynamic scheduler (STRADS) organize/distribute worker tasks; ◦ Dynamic Scheduling and Adaptive Load Balancing.  ML library includes (will be steadily enriched): ◦ Convolutional Neural Network (CNN) ◦ Distance Metric Learning ◦ Multiclass Logistic Regression ◦ Nonnegative Matrix Factorization ◦ Sparse Coding ◦ K-means ◦ MedLDA advanced topic model ◦ LDA topic model ◦ Matrix Factorization (collaborative filtering) ◦ Fully-connected Deep Neural Networks  Run bigger models on less hardware, very economical!
  141. 141. Petuum: Iterative-Convergent Distributed ML Petuum system: scheduler, workers, parameter servers.
  142. 142. Petuum: Iterative-Convergent Distributed ML Petuum parameter server topology. Servers and clients interact via a bipartite topology. name-node machine handles bookkeeping and assignment of keys to servers. STRADS architecture. The worker machines can be Petuum-PS clients or nodes without PS support. SSPTable Programming
  143. 143. The New Generation Parameter Server  The model shared among nodes can be represented as a set of (key, value) pairs; ◦ Partitions keys much as a conventional distributed hash table does: consistent hashing;  Data is sent between nodes using push and pull (supporting range based) operations;  The asynchronous communication model is optimized for machine learning tasks to reduce network traffic and overhead;  Relaxed consistency further hides synchronization cost and latency to balance algorithmic convergence rate and system efficiency, depending on data, algorithm, and hardware; ◦ Asynchronous task dependency graphs: the caller perform immediately after issuing a task; ◦ User defined filters: selectively sync of (key, values) pairs within a task;  New nodes can be added without restarting the running framework;  Vector clocks (range-based) recording the time of each node on (key, value) pair, ensure well defined behavior after network partition and failure;  The globally shared parameters are represented as (potentially sparse) vectors and matrices to facilitate development of machine learning applications
  144. 144. SINGA: A Distributed Deep Learning Platform  SINGA is a distributed deep learning platform, for training large-scale deep learning models;  SINGA allows users to write training algorithms by exposing intuitive programming abstractions and hiding complex details pertaining distributed execution of the training;  Specifically, the programming model consists of data objects (layer and network), and of computation functions over the data objects;  SINGA uses an optimization algorithm that generates the parallelism scheme with minimal overhead; ◦ Model Partition: one model replica spreads across multiple machines to handle large models with overhead of synchronizing layer data across machines within one model replica Partition; ◦ Data Partition: one model replica trains against a partition of the whole training dataset with overhead of synchronizing parameters among model replicas; ◦ Hybrid Partition: exploit a cost model to find optimal model and data partitions which would reduce both overheads.
  145. 145. • Implement the Deep Big Multilayer Perceptron, trained over MNIST dataset with about 10 million parameters; • Implement the Deep Convolutional Neual Network for classifying the ImageNet dataset from ILSVRC2012. SINGA: A Distributed Deep Learning Platform
  146. 146. SINGA: A Distributed Deep Learning Platform SINGA architecture, whose logical components include, table servers maintaining a distributed parameter table, worker groups running the optimization algorithm in parallel against a data shard containing a partition of the training data, and a master who is in charge of partitioning the model and training data as well as of resource coordination.
  147. 147. SINGA: A Distributed Deep Learning Platform The logical system architecture. Workers and servers are grouped according to the cluster configuration. Each worker group runs against a partition of the training dataset to compute the updates (e.g., the gradients) of ParamShard. Worker groups run asynchronously, while workers within one group run synchronously. Each server group also maintains ParamShard. It receives and handles requests (e.g., Get/Put/Update) from workers. Every server group synchronizes with neighboring server groups.
  148. 148. Parameter Server Framework • MSR’s Adam: 90 machines for workers, 20 for PS, 10 for data servers; • Google’s DistBelief: 144 machines (~2300 cores), up to 1000 machines; • Petuum: Three experiment settings • 128 nodes and 2 AMD cores per node with 1GB Ethernet; • 4 nodes with 64 AMD cores and 40GB Infiniband; • 4 nodes and 16 Intel cores per node with 10GB Ethernet; • CMU PS: Three experiment settings • 15 machines for 90 virtual nodes (each node has 64 cores with connection of 40GB Ethernet); • 1000 machines, each with 16 cores, connected with 10GB Ethernet (800 for workers, 200 for PS); • 10 physical cores with 10 GB network connectivity, one with 800 + 200 (workers and PS) and another with 5000 + 2000. • SINGA: A in-house cluster • 72 nodes on 2 racks, 1Gbps switch btw intra-rack nodes, 4 cores per node; • Tencent’s Mariana: CPU+GPU • 2 Intel 8-core CPUs and 4-6 Nvidia Tesla GPUs (6 for ASR and 4 for ImageNet);
  149. 149. Parameter Server Framework
  150. 150. MXNET: Distributed Deep Learning  Programming interfaces in a consistent way; ◦ Tensor algebra interface (NDArray): compatible, GPU/CPU/ARM; ◦ Symbolic expression (Symbol): easy deep network design; ◦ Mixed programming (KVStore): CPU/GPU, symbolic + local;  Multiple languages: C++/Python/R/Julia/…;  All operations issued into backend C++ engine ◦ Engine builds read/write dependency graph; ◦ Lazy evaluation, parallel execution and sophisticated memory allocation;  Fast, memory efficient and distributed learning; ◦ Data Parallelization; ◦ Synchronous vs Asynchronous; ◦ Work on AWS: S3 data host and EC2 (GPU unit);
  151. 151. MXNET: Heterogeneous and Distributed Computation graph for both forward and backward Communication MXNET Overview
  152. 152. TensorFlow: Large-Scale Heterogeneous Distributed ML  An interface for expressing machine learning algorithms, and an implementation for executing such algorithms; ◦ A computation is described by a directed data flow graph, which is composed of a set of nodes; ◦ An operation (node) has a name and represents an abstract computation; ◦ A kernel is a particular implementation of an operation run on the device; ◦ Clients interact with the system by creating a Session;  Session interface supports an Extend method to augment the graph with additional nodes and edges; ◦ Variable returns a handle to a persistent mutable tensor that survives across executions of a graph.  Executed on heterogeneous systems, mobile devices to large-scale distributed systems; ◦ The client uses the Session interface to communicate with the master, and worker processes are responsible for arbitrating access to one or more computational devices and for executing graph nodes instructed by the master; ◦ Devices are the computational heart and tensor is implementation of the multi- dimensional array;  Training and inference algorithms for deep neural network models;  API and a reference implementation were released as open-source
  153. 153. TensorFlow: Large-Scale Heterogeneous Distributed ML Single machine and distributed system structure Synchronous and asynchronous data parallel training Model parallel training
  154. 154. TensorFlow: Large-Scale Heterogeneous Distributed ML In TensorFlow, all computation—including parameter management—is represented in the dataflow graph, and the system maps the graph onto heterogeneous devices (like multi-core CPUs, general-purpose GPUs, and mobile processors) in the available processes. 1. Place an individual model replica on each GPU. 2. Update model parameters synchronously by waiting for all GPUs to finish processing a batch of data. Note: store and update all model parameters on the CPU.
  155. 155. DMTK: Distributed Machine Learning Toolkit @ MSR  DMTK Framework: A parameter server with client SDK supports unified interface for data parallelization, hybrid data structure for big model storage, model scheduling for big model training, and automatic pipelining for high training efficiency; ◦ Hybrid data structure for model storage, which uses separate data structures for high/low-frequency parameters, and thus achieves outstanding balance between memory capacity and access speed. ◦ Acceptance and aggregation of updates from local workers, which supports several different mechanisms for model sync-up, including BSP, ASP, SSP, in a unified manner. ◦ Local model cache, which syncs up with the parameter server only when necessary, thus greatly reduces the communication cost. ◦ Pipeline between local training and model communication, which enables very high training speed regardless of various conditions of computational resources and network connections. ◦ Scheduling big model training in a round-robin fashion, which allows each worker machine to pull the sub-models as needed from the parameter server, resulting in a frugal use of limited memory capacity and network bandwidth to support very big models.  LightLDA, an extremely fast and scalable topic model algorithm, with a O(1) Gibbs sampler and an efficient distributed implementation;  Distributed (Multisense) Word Embedding, a distributed version of (multi-sense) word embedding algorithm.
  156. 156. DMTK: Distributed Machine Learning Toolkit @ MSR
  157. 157. CNTK at MSR  Computational Network Toolkit (CNTK): a unified deep-learning toolkit that describes NNs as a series of computational steps via a directed graph;  Azure GPU Lab (Project Philly), script beta, C++ code and Python, C#;  GPU cluster: cuDNN-v4 supported;  Modularized separation: computation network, execution engine, learning algorithm, model description, data reader;  Leaf nodes represent input values or network parameters, while other nodes matrix operations upon their inputs;  Operations: train, edit, evaluate, write etc.  Computation determined by depth first traverse of a DAG;  Popular models: feed-forward DNNs, CNNs, RNNs/LSTMs, DSSM etc.  Implement SGD for error BP learning with automatic differentiation and parallelization across multiple GPUs and servers.  Data parallelism: 1-bit quantized SGD to reduce data communication cost;  Memory sharing: same memory across mini-batches, across computation nodes;  An open-source license since April 2015:
  158. 158. CNTK at MSR