Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Architecture - Technology and Platforms

14,914 views

Published on

All big data technology

Published in: Technology

Big Data Architecture - Technology and Platforms

  1. 1. (Big-)Data Architecture (Re-)Invented Part-2 William El Kaim Dec. 2016 – V3.3
  2. 2. This Presentation is part of the Enterprise Architecture Digital Codex http://www.eacodex.com/Copyright © William El Kaim 2016 2
  3. 3. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 3
  4. 4. Storage vs. Processing Processing Hadoop Distributed Storage Distributed FS Local FS NoSQL datastores GlusterFS HDFS S3 CephCassandra RingDynamoDB OLAP OLTP Machine Learning HBase Impala Hawq Map Reduce / Tez Map Reduce / Tez R, Python,… MahoutStreaming Cascading R, Python,… Hive Pig StreamingCascading Spark Spark Openstack SwiftIsilon Scalding Giraph Hama SciKit Stinger MapR Source: Octo TechnologyCopyright © William El Kaim 2016 4
  5. 5. Big Data Technologies Copyright © William El Kaim 2016 5
  6. 6. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Ingestion Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 6
  7. 7. Understanding Streaming Semantics Copyright © William El Kaim 2016 7
  8. 8. Ingestion Technologies Apache Flume • Apache Flume is a distributed and reliable service for efficiently collecting, aggregating, and moving large amounts of streaming data into HDFS (especially “logs”). • Data is pushed to the the destination (Push Mode). • Flume does not replicate events - in case of flume-agent failure, you will lose events in the channel Copyright © William El Kaim 2016 8
  9. 9. Ingestion Technologies Apache Kafka • Apache Kafka is a a fast, scalable, durable, and fault-tolerant publish- subscribe messaging system, developed by LinkedIn, that persists messages to disk (Pull Mode) • Designed for high Throughput, Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, reliability, and replication. • Use Topics which many listeners can subscribe to, and thus processing of messages can happen in parallel on various channels • High availability of events (recoverable in case of failures) Copyright © William El Kaim 2016 9
  10. 10. Ingestion Technologies Apache Storm • Apache Storm, developed by BackType (bought wy Twitter) is a reliable real- time system for processing streaming data in real time (and generating new streams). • Designed to support wiring “spouts” (think input streams) and “bolts” (processing and output modules) together as a directed acyclic graph (DAG) called a topology. • One strength is the catalogue of available spouts specialized for receiving data from all types of sources. • Storm topologies run on clusters and the Storm scheduler distributes work to nodes around the cluster, based on the topology configuration. Copyright © William El Kaim 2016 10
  11. 11. Ingestion Technologies Storm: Example • Twitter streams, counting words, and storing them in NoSQL database Source: TrivadisCopyright © William El Kaim 2016 11
  12. 12. Ingestion Technologies Storm: Example Source: TrivadisCopyright © William El Kaim 2016 12
  13. 13. Ingestion Technologies Twitter Heron • Twitter dropped Apache Storm in production in 2015 and replaced it with a homegrown data processing system, named Heron. • Apache Storm was the original solution to Twitter's problems. • Storm it was reputedly hard to work with and hard to get good results from, and despite a recent 1.0 renovation, it's been challenged by other projects, including Apache Spark and its own revised streaming framework. • Heron was built from scratch with a container- and cluster-based design, outlined in a research paper. • The user creates Heron jobs, or "topologies," and submits them to a scheduling system, which launches the topology in a series of containers. • The scheduler can be any of a number of popular schedulers, like Apache Mesos or Apache Aurora. Storm, by contrast, has to be manually provisioned on clusters to add scale. • In May 2016 Twitter released Heron under an open source license Source: InfoworldCopyright © William El Kaim 2016 13
  14. 14. Ingestion Technologies Twitter Heron • Heron is backward-compatible with Storm's API. • Storm spouts and bolts could then be reused in Heron • Gives existing Storm users some incentive to check out Heron. • Heron • Code is to be written in Java (or Scala • The web-based UI components are written in Python • The critical parts of the framework, the code that manages the topologies and network communications is written in C++. • Twitter claims it's been able to gain anywhere from two to five times an improvement in "efficiency" (basically, lower opex and capex) with Heron. Source: InfoworldCopyright © William El Kaim 2016 14
  15. 15. Ingestion Technologies Apache Spark • Spark supports real-time distributed computation and stream-oriented processing, but it's more of a general-purpose distributed computing platform. • In-memory data storage for very fast iterative processing • Replacement for the MapReduce functions of Hadoop, running on top of an existing Hadoop cluster, relying on YARN for resource scheduling. • Spark can layer on top of Mesos for scheduling or run as a stand-alone cluster using its built-in scheduler. • Spark shines is in its support for multiple processing paradigms and the supporting libraries Copyright © William El Kaim 2016 15
  16. 16. Ingestion Technologies Apache Spark Source: Ippon Source: Databricks Copyright © William El Kaim 2016 16
  17. 17. Ingestion Technologies Apache Spark • Spark Core • General execution engine for the Spark platform • In-memory computing capabilities deliver speed • General execution model supports wide variety of use cases • Spark Streaming • Run a streaming computation as a series of very small, deterministic batch jobs • Batch size as low as ½ sec, latency of about 1 sec • Exactly-once semantics • Potential for combining batch and streaming processing in same system Copyright © William El Kaim 2016 17
  18. 18. Ingestion Technologies Apache Spark • At the core of Apache Spark is the notion of data abstraction as distributed collection of objects called Resilient Distributed Dataset (RDD) • RDD Allows you to write programs that transform these distributed datasets. • RDDs are Immutable, recomputable, and fault tolerant distributed collection of objects (partitions) spread across a cluster of machines • data can be stored in memory or in disk (local). • RDD enables parallel processing on data sets • Data is partitioned across machines in a cluster that can be operated in parallel with a low- level API that offers transformations and actions. • RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure. • Contains transformation history (“lineage”) for whole data set • Operations • Stateless Transformations (map, filter, groupBy) • Actions (count, collect, save) Source: TrivadisCopyright © William El Kaim 2016 18
  19. 19. Ingestion Technologies Apache Spark • DataFrame is an immutable distributed collection of data (like RDD) • Unlike an RDD, data is organized into named columns, like a table in a relational database. • Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction. • It provides a domain specific language API to manipulate your distributed data • makes Spark accessible to a wider audience, beyond specialized data engineers. • Datasets • Introduced in Spark 1.6, the goal of Spark Datasets is to provide an API that allows users to easily express transformations on domain objects, while also providing the performance and benefits of the robust Spark SQL execution engine. • In Spark 2.0, the DataFrame APIs will merge with Datasets APIs, unifying data processing capabilities across all libraries. Copyright © William El Kaim 2016 19
  20. 20. Ingestion Technologies Apache Spark Machine Learning • MLlib is Apache Spark general machine learning library • Allows data scientists to focus on their data problems and models instead of solving the complexities surrounding distributed data (such as infrastructure, configurations, etc.). • The data engineers can focus on distributed systems engineering using Spark’s easy-to- use APIs, while the data scientists can leverage the scale and speed of Spark core. • ML Pipelines • Running machine learning algorithms involves executing a sequence of tasks including pre-processing, feature extraction, model fitting, and validation stages. • High-Level API for MLlib that lives under the “spark.ml” package. • A pipeline consists of a sequence of stages. There are two basic types of pipeline stages: Transformer and Estimator. • A Transformer takes a dataset as input and produces an augmented dataset as output. • An Estimator must be first fit on the input dataset to produce a model, which is a Transformer that transforms the input dataset. Copyright © William El Kaim 2016 20
  21. 21. Ingestion Technologies Google Cloud Dataflow • Fully-managed cloud service and programming model for batch and streaming big data processing. • Used for developing and executing a wide range of data processing patterns including ETL, batch computation, and continuous computation. • Cloud Dataflow “frees” from operational tasks like resource management and performance optimization. • The open source Java-based Cloud Dataflow SDK enables developers to implement custom extensions and to extend Dataflow to alternate service environments Source: GoogleCopyright © William El Kaim 2016 21
  22. 22. Ingestion Technologies Google Dataflow vs. Spark • Dataflow is clearly faster than Spark. • But Spark has an ace up its sleeve in the form of REPL, or its “read evaluate print loop” functionality, which enables users to iterate on their problems quickly and easily. • “If you have a bunch of data scientists and you’re trying to figure out what they want to do, and they need to play around a lot, then Spark may be a better solution for those sorts of cases,” Oliver says. • While Spark maintains an edge among data scientists looking to iterate quickly, Google Cloud Dataflow seems to hold the advantage in the operations department, thanks to all the work that Google has done over the years to optimize queries at scale. • “Google Cloud Dataflow has some key advantages, in particular if you have a well thought out process that you’re trying to implement, and you’re trying to do it cost effectively…then Google Cloud Dataflow is an excellent option for doing it at scale and at a lower cost,” Oliver says. Source: DatanamiCopyright © William El Kaim 2016 22
  23. 23. Ingestion Technologies Apache Beam • Apache Beam is an open source, unified programming model used to create a data processing pipeline. • Start by building a program that defines the pipeline using one of the open source Beam SDKs. • The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow. • Beam is particularly useful for Embarrassingly Parallel data processing tasks, in which the problem can be decomposed into many smaller bundles of data that can be processed independently and in parallel. • Beam cans also be used for Extract, Transform, and Load (ETL) tasks and pure data integration. Copyright © William El Kaim 2016 23
  24. 24. Ingestion Technologies Concord http://concord.io/Copyright © William El Kaim 2016 24
  25. 25. Ingestion Technologies Apache Flink • Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. • Flink includes several APIs for creating applications that use the Flink engine: • DataStream API for unbounded streams embedded in Java and Scala, and • DataSet API for static data embedded in Java, Scala, and Python, • Table API with a SQL-like expression language embedded in Java and Scala. • Flink also bundles libraries for domain-specific use cases: • CEP, a complex event processing library, • Machine Learning library, and • Gelly, a graph processing API and library. Copyright © William El Kaim 2016 25
  26. 26. Ingestion Technologies Apache Flink Source: Ippon Source: Apache Copyright © William El Kaim 2016 26
  27. 27. Ingestion Technologies Apache Flink Commercial Support Data ArtisansCopyright © William El Kaim 2016 27
  28. 28. Ingestion Technologies Spark vs. Flink • Flink is: • optimized for cyclic or iterative processes by using iterative transformations on collections. • This is achieved by an optimization of join algorithms, operator chaining and reusing of partitioning and sorting. • However, Flink is also a strong tool for batch processing. • Spark is: • based on resilient distributed datasets (RDDs). • This (mostly) in-memory data structure gives the power to sparks functional programming paradigm. It is capable of big batch calculations by pinning memory. Source: Zalando Source: Quora Copyright © William El Kaim 2016 28
  29. 29. Ingestion Technologies Apache APEX • Apache Apex is a YARN-native integrated platform that unifies stream and batch processing. • It processes big data in-motion in a way that is highly scalable, highly performant, fault tolerant, statefull, secure, and distributed. • Github • Comparisons to others • Spark and Storm are considered difficult to use. They’re built on batch engines, rather than true streaming architecture, and don’t natively support statefull computation, • They can’t do low-latency processing that Apex and Flink can, and will suffer a latency overhead for having to schedule batches repeatedly, no matter how quickly that occurs. • Use cases • GE’s Predix IoT cloud platform uses Apex for industrial data and analytics • Capital One for real-time decisions and fraud detection. Source: ASFCopyright © William El Kaim 2016 29
  30. 30. Ingestion Technologies • Apache Samza • Samza is a distributed stream-processing framework that is based on Apache Kafka and YARN. • It provides a simple callback-based API that’s similar to MapReduce, and it includes snapshot management and fault tolerance in a durable and scalable way. • Amazon Kinesis • Kinesis is Amazon’s service for real-time processing of streaming data on the cloud. • Deeply integrated with other Amazon services via connectors, such as S3, Redshift, and DynamoDB, for a complete Big Data architecture. Copyright © William El Kaim 2016 30
  31. 31. Ingestion Technologies • NFS Gateway • The NFS Gateway supports NFSv3 and allows HDFS to be mounted as part of the client’s local file system. • Apache Sqoop • Tool designed for efficiently transferring bulk data between Hadoop and structured data stores (and vice-versa). • Import data from external structured data stores into a HDFS • Extract data from Hadoop and export it to external structured data stores like relational database and enterprise data warehouses. Copyright © William El Kaim 2016 31
  32. 32. Ingestion Technologies Sqoop Example Source: Rubén Casado TejedorCopyright © William El Kaim 2016 32
  33. 33. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Storage Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 33
  34. 34. Storage Technology Rise of Immutable Datastore • In a relational database, files are mutable, which means a given cell can be overwritten when there are changes to the data relevant to that cell. • New architectures offer accumulate- only file system that overwrites nothing. Each file is immutable, and any changes are recorded as separate timestamped files. • The method lends itself not only to faster and more capable stream processing, but also to various kinds of historical time- series analysis. Source: PWCCopyright © William El Kaim 2016 34
  35. 35. Storage Technology Why is immutability in big data stores significant? • Fewer dependencies & Higher-volume data handling and improved site-response capabilities • Immutable files reduce dependencies or resource contention, which means one part of the system doesn’t need to wait for another to do its thing. That’s a big deal for large, distributed systems that need to scale and evolve quickly. • More flexible reads and faster writes • writing data without structuring it beforehand means that you can have both fast reads and writes, as well as more flexibility in how you view the data. • Compatibility with Hadoop & log-based messaging protocols • A popular method of distributed storage for less-structured data. • Ex: Apache Samza and Apache Kafka are symbiotic and compatible with the Hadoop Distributed File System (HDFS),. • Suitability for auditability and forensics • Log-centric databases and the transactional logs of many traditional databases share a common design approach that stresses consistency and durability (the C and D in ACID). • But only the fully immutable shared log systems preserve the history that is most helpful for audit trails and forensics. Copyright © William El Kaim 2016 35
  36. 36. Storage Technologies: Databases Evolutions Source: PWCCopyright © William El Kaim 2016 36
  37. 37. Storage Technologies: Cost & Speed Copyright © William El Kaim 2016 37
  38. 38. • HDFS: Distributed FileSystem for Hadoop • A Java-based filesystem that provides scalable and reliable data storage. Designed to span large clusters of commodity servers. • Master-Slaves Architecture (NameNode – DataNodes) • NameNode: Manage the directory tree and regulates access to files by clients • DataNodes: Store the data. Files are split into blocks of the same size and these blocks are stored and replicated in a set of DataNodes • Apache Hive • An open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. • Abstraction layer on top of MapReduce • SQL-like language called HiveQL. • Metastore: Central repository of Hive metadata. Storage Technologies Copyright © William El Kaim 2016 38
  39. 39. Storage Technologies • Apache KUDU • Kudu is an innovative new storage engine that is designed from the ground up to overcome the limitations of various storage systems available today in the Hadoop ecosystem. • For the very first time, Kudu enables the use of the same storage engine for large scale batch jobs and complex data processing jobs that require fast random access and updates. • As a result, applications that require both batch as well as real-time data processing capabilities can use Kudu for both types of workloads. • With Kudu’s ability to handle atomic updates, you no longer need to worry about boundary conditions relating to late-arriving or out-of-sequence data. • In fact, data with inconsistencies can be fixed in place in almost real time, without wasting time deleting or refreshing large datasets. • Having one system of record that is capable of handling fast data for both analytics and real- time workloads greatly simplifies application design and implementation. Copyright © William El Kaim 2016 39
  40. 40. Storage Technologies • HBase • An open source, non-relational, distributed column-oriented database written in Java. • Modeled after Google’s BigTable and developed as part of Apache Hadoop project, it runs on top of HDFS. • Random, real time read/write access to the data. • Very light «schema», Rows are stored in sorted order. • MapR DB • An enterprise-grade, high performance, in-Hadoop No-SQL database management system, MapR is used to add real-time operational analytics capabilities to Hadoop. • Pivotal HDB • Hadoop Native SQL Database powered by Apache HAWQ Copyright © William El Kaim 2016 40
  41. 41. Storage Technologies • Apache Impala • Open source MPP analytic database built to work with data stored on open, shared data platforms like Apache Hadoop’s HDFS filesystem, Apache Kudu’s columnar storage, and object stores like S3. • By being able to query data from multiple sources stored in different, open formats like Apache Parquet, Apache Avro, and text, Impala decouples data and compute and lets users query data without having to move/load data specifically into Impala clusters. • In the cloud, this capability is especially useful as you can create transient clusters with Impala to run your reports/analytics and shut down the cluster when you are done or elastically scale compute power to support peak demands, letting you save on cluster-hosting costs. • Impala is designed to run efficiently on large datasets, and scales to hundreds of nodes and hundreds of users. • You can learn more about the unique use cases Impala on S3 delivers in this blog post. Copyright © William El Kaim 2016 41
  42. 42. Storage Technologies • MemSQL • MemSQL unveiled its “Spark Streamliner” initiative, in which it incorporated Apache Spark Streaming as a middleware component to buffer the parallel flow of data coming from Kafka before it’s loaded into MemSQL’s consistent storage. • This enabled customers like Pintrest to eliminate batch processing and move to continuous processing of data. • The exactly-once semantics is available through the “Create Pipeline” command in MemSQL version 5.5. • The command will automatically extract data from the Kafka source, perform some type of transformation, and then load it into the MemSQL database’s lead nodes (as opposed to loading them in MemSQL’s aggregator nodes first, as it did with Streamliner). • The database can work on multiple, simultaneous streams, and while adhering to exactly-once semantics Copyright © William El Kaim 2016 42
  43. 43. Storage Technology Technology Landscape Source: Octo TechnologyCopyright © William El Kaim 2016 43
  44. 44. Storage Technology: Encoding Format • A huge bottleneck for HDFS-enabled applications like MapReduce and Spark is the time it takes to find relevant data in a particular location and the time it takes to write the data back to another location. • These issues are exacerbated with the difficulties managing large datasets, such as evolving schemas, or storage constraints. • Choosing an appropriate file format can have some significant benefits: • Faster read times • Faster write times • Splittable files (so you don’t need to read the whole file, just a part of it) • Schema evolution support (allowing you to change the fields in a dataset) • Advanced compression support (compress the files with a compression codec without sacrificing these features) Copyright © William El Kaim 2016 44Source: Matthew Rathbone
  45. 45. Storage Technology: Encoding Format • The format of the files you can store on HDFS, like any filesystem, is entirely up to you. • However unlike a regular file system, HDFS is best used in conjunction with a data processing toolchain like MapReduce or Spark. • These processing systems typically (although not always) operate on some form of textual data like webpage content, server logs, or location data. • If you’re just getting started with Hadoop, HDFS, Hive and wondering what file format you should be using to begin with, then use tab delimited files for your prototyping (and first production jobs). • They’re easy to debug (because you can read them), they are the default format of Hive, and they’re easy to create and reason about. • Once you have a production MapReduce or Spark job regularly generating data come back and pick something better Copyright © William El Kaim 2016 45Source: Matthew Rathbone
  46. 46. Encoding Format: Text Files (E.G. CSV, TSV) • Data is laid out in lines, with each line being a record. Lines are terminated by a newline character n in the typical Unix fashion. • Text-files are inherently splittable (just split on n characters!), but if you want to compress them you’ll have to use a file-level compression codec that support splitting, such as BZIP2 • Because these files are just text files you can encode anything you like in a line of the file. • One common example is to make each line a JSON document to add some structure. While this can waste space with needless column headers, it is a simple way to start using structured data in HDFS. Copyright © William El Kaim 2016 46Source: Matthew Rathbone
  47. 47. Encoding Format: Sequence Files • Sequence files were originally designed for MapReduce. • They encode a key and a value for each record and nothing more. • Records are stored in a binary format that is smaller than a text-based format would be. • Like text files, the format does not encode the structure of the keys and values, so if you make schema migrations they must be additive. • Sequence files by default use Hadoop’s Writable interface in order to figure out how to serialize and de-serialize classes to the file. • Typically if you need to store complex data in a sequence file you do so in the value part while encoding the id in the key. The problem with this is that if you add or change fields in your Writable class it will not be backwards compatible with the data stored in the sequence file. • One benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks. • Sequence files are well supported across Hadoop and many other HDFS enabled projects, and I think represent the easiest next step away from text files. • More : Apache Hadoop SequenceFile wiki Copyright © William El Kaim 2016 47Source: Matthew RathboneSource: Matthew Rathbone
  48. 48. Encoding Format: AVRO • Avro is not really a file format, it’s a file format plus a serialization and deserialization framework. • Encodes the schema of its contents directly in the file which allows to store complex objects natively. • Avro provides: • Rich data structures. • A compact, fast, binary data format. • A container file, to store persistent data. • Remote procedure call (RPC). • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages. • Avro defines file data schemas in JSON (for interoperability), allows for schema evolutions (remove a column, add a column), and multiple serialization/deserialization use cases. • Avro supports block-level compression. • For most Hadoop-based use cases Avro is a really good choice. • More: Apache Avro web site Copyright © William El Kaim 2016 48Source: Matthew Rathbone
  49. 49. Encoding Format: Columnar File Formats • The latest evolution concerning file formats for Hadoop is columnar file storage. • Basically this means that instead of just storing rows of data adjacent to one another you also store column values adjacent to each other. • So datasets are partitioned both horizontally and vertically. This is particularly useful if your data processing framework just needs access to a subset of data that is stored on disk as it can access all values of a single column very quickly without reading whole records. • One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed together which can yield some massive storage optimizations (as data in the same column tends to be similar). • If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of your application • if you have an application that usually needs entire rows of data then the columnar formats may actually be a detriment to performance due to the increased network activity required. • Overall these formats can drastically optimize workloads, especially for Hive and Spark which tend to just read segments of records rather than the whole thing (which is more common in MapReduce). • Two file formats: • Apache Parquet seems to have the most community support. • RCFile like Apache Orc Copyright © William El Kaim 2016 49Source: Matthew Rathbone
  50. 50. Storage Technology: Encoding Format Copyright © William El Kaim 2016 50
  51. 51. Two Ways To Compress Data In Hadoop • File-Level Compression • compress entire files regardless of the file format, the same way you would compress a file in Linux. Some of these formats are splittable (e.g. bzip2, or LZO if indexed). • Block-Level Compression • Is internal to the file format, so individual blocks of data within the file are compressed. • This means that the file remains splittable even if you use a non-splittable compression codec • Snappy is a great balance of speed and compression ratio. Copyright © William El Kaim 2016 51Source: Matthew Rathbone
  52. 52. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Processing Paradigms & Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 52
  53. 53. Hadoop Processing Paradigms Batch processing • Large amount of statics data • Generally incurs a high-latency / Volume Real-time processing • Compute streaming data • Low latency • Velocity Hybrid computation • Lambda Architecture • Volume + Velocity Source: Rubén Casado & ClouderaCopyright © William El Kaim 2016 53
  54. 54. Hadoop Processing Paradigms & Time Copyright © William El Kaim 2016 54
  55. 55. Hadoop Batch processing • Scalable • Large amount of static data • Distributed • Parallel • Fault tolerant • High latency Volume Source: Rubén CasadoCopyright © William El Kaim 2016 55
  56. 56. • MapReduce was designed by Google as a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. • Key Terminology • Job: A “full program” - an execution of a Mapper and Reducer across a data set • Task: An execution of a Mapper or a Reducer on a slice of data – a.k.a. Task- In-Progress (TIP) • Task Attempt: A particular instance of an attempt to execute a task on a machine Hadoop – Batch Processing - Map Reduce Copyright © William El Kaim 2016 56
  57. 57. Source: HadooperCopyright © William El Kaim 2016 57
  58. 58. • Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). • MapReduce can take advantage of the locality of data, processing it near the place it is stored in order to reduce the distance over which it must be transmitted. • "Map" step • Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. • A master node ensures that only one copy of redundant input data is processed. • "Shuffle" step • Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. • "Reduce" step • Worker nodes now process each group of output data, per key, in parallel. Hadoop – Batch Processing - Map Reduce Copyright © William El Kaim 2016 58
  59. 59. Hadoop – Batch Processing - Map Reduce Copyright © William El Kaim 2016 59
  60. 60. Batch Processing Technologies Source: Rubén CasadoCopyright © William El Kaim 2016 60
  61. 61. Batch Processing Architecture Example Source: Helena EdelsonCopyright © William El Kaim 2016 61
  62. 62. Real-time Processing • Low latency • Continuous unbounded streams of data • Distributed • Parallel • Fault-tolerant Velocity Source: Rubén CasadoCopyright © William El Kaim 2016 62
  63. 63. Real-time Processing Technologies Source: Rubén CasadoCopyright © William El Kaim 2016 63
  64. 64. • Computational model and Infrastructure for continuous data processing, with the ability to produce low-latency results • Data collected continuously is naturally processed continuously (Event Processing or Complex Event Processing -CEP) • Stream processing and real-time analytics are increasingly becoming where the action is in the big data space. • As real-time streaming architectures like Kafka continue to gain steam, companies that are building next-generation applications upon them will debate the merits of the unified and the federated approaches Real-time (Stream) Processing Source: TrivadisCopyright © William El Kaim 2016 64
  65. 65. Real-time (Stream) Processing Source: TrivadisCopyright © William El Kaim 2016 65
  66. 66. Real-time (Stream) Processing Arch. Pattern Source: ClouderaCopyright © William El Kaim 2016 66
  67. 67. Real-time (Stream) Processing • (Event-) Stream Processing • A one-at-a-time processing model • A datum is processed as it arrives • Sub-second latency • Difficult to process state data efficiently • Micro-Batching • A special case of batch processing with very small batch sizes (tiny) • A nice mix between batching and streaming • At cost of latency • Gives statefull computation, making windowing an easy task Source: TrivadisCopyright © William El Kaim 2016 67
  68. 68. Hybrid Computation Model • Low latency • Massive data + Streaming data • Scalable • Combine batch and real-time results Volume Velocity Source: Rubén CasadoCopyright © William El Kaim 2016 68
  69. 69. Hybrid Computation: Lambda Architecture • Data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. • A system consisting of three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries. • This approach to architecture attempts to balance latency, throughput, and fault- tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. • The two view outputs may be joined before presentation. • Lambda Architecture case stories via lambda-architecture.net Source: KrepsCopyright © William El Kaim 2016 69
  70. 70. • Batch layer • Receives arriving data, combines it with historical data and recomputes results by iterating over the entire combined data set. • The batch layer has two major tasks: • managing historical data; and recomputing results such as machine learning models. • Operates on the full data and thus allows the system to produce the most accurate results. However, the results come at the cost of high latency due to high computation time. • The speed layer • Is used in order to provide results in a low-latency, near real-time fashion. • Receives the arriving data and performs incremental updates to the batch layer results. • Thanks to the incremental algorithms implemented at the speed layer, computation cost is significantly reduced. • The serving layer enables various queries of the results sent from the batch and speed layers. Hybrid Computation: Lambda Architecture Source: KrepsCopyright © William El Kaim 2016 70
  71. 71. Hybrid computation: Lambda Architecture Source: MaprCopyright © William El Kaim 2016 71
  72. 72. Hybrid computation: Lambda Architecture DataTorrent • DataTorrent RTS Core • Open source enterprise-grade unified stream and batch platform • High performing, fault tolerant, scalable, Hadoop-native in-memory platform • Supports Kafka, HDFS, AWS S3n, NFS, (s)FTP, JMS • dtManage - DataTorrent Management Console • Hadoop-integrated application that provides an intuitive graphical management interface for Devops teams • manage, monitor, update and troubleshoot the DataTorrent RTS system and applications Source: DataTorrentCopyright © William El Kaim 2016 72
  73. 73. Ex: Novelti.io (ex. Lambdoop) Batch Real-Time Hybrid Source: Novelti.ioCopyright © William El Kaim 2016 73
  74. 74. Ex: Lambda Architecture Source: DatastaxCopyright © William El Kaim 2016 74
  75. 75. Ex: Lambda Architecture Stacks Source: Helena EdelsonCopyright © William El Kaim 2016 75
  76. 76. Different Streaming Architecture Vision • Hadoop major distributors have different views on how streaming fits into traditional Hadoop architectures. • Hortonworks has taken a data plane approach (with HDP) • that seeks to virtually connect multiple data repositories in a federated manner • to unify the security and governance of data existing in different places (on- and off- premise data lakes like HDP and streaming data platforms like HDF). • Specifically, it’s building hooks between Apache Atlas (the data governance component) and Apache Knox (the security tool) that give customers a single view of their data. • MapR is going all-in on the converged approach that stressed the importance of a single unified data repository. • Cloudera, meanwhile, sits somewhere in the middle (although it’s probably closer to MapR). Source: DatanamiCopyright © William El Kaim 2016 76
  77. 77. Ex: Lambda Architecture Cloudera Vision • Kafka as the piece of a larger real-time or near real-time architecture • Combination of Kafka and Spark Streaming for the so called speed layer. • In conjunction with a batch layer, leading to the use of lambda architecture • Because people want to operate with larger history of events • Kudu project as the real optimized store for Lambda architectures because • KUDU offers a happy medium between the scan performance of HDFS and the record- level updating capability of Hbase. • It enables real-time response to single events and can be the speed layer and batch layer for a single store Source: DatanamiCopyright © William El Kaim 2016 77
  78. 78. Hybrid computation: Kappa Architecture • Proposal from Jay Kreps (LinkedIn) in this article. • Then talk “Turning the database inside out with Apache Samza” by Martin Kleppmann • Main objective • Avoid maintaining two separate code bases for the batch and speed layers (lambda). • Key benefits • Handle both real-time data processing and continuous data reprocessing using a single stream processing engine. • Data reprocessing is an important requirement for making visible the effects of code changes on the results. Source: KrepsCopyright © William El Kaim 2016 78
  79. 79. Hybrid computation: Kappa Architecture • Architecture is composed of only two layers: • The stream processing layer runs the stream processing jobs. • Normally, a single stream processing job is run to enable real-time data processing. • Data reprocessing is only done when some code of the stream processing job needs to be modified. • This is achieved by running another modified stream processing job and replying all previous data. • The serving layer is used to query the results (like the Lambda architecture). Source: O’ReillyCopyright © William El Kaim 2016 79
  80. 80. Hybrid computation: Kappa Architecture • Intrinsically, there are four main principles in the Kappa architecture: • Everything is a stream: Batch operations become a subset of streaming operations. Hence, everything can be treated as a stream. • Immutable data sources: Raw data (data source) is persisted and views are derived, but a state can always be recomputed as the initial record is never changed. • Single analytics framework: Keep it short and simple (KISS) principle. A single analytics engine is required. Code, maintenance, and upgrades are considerably reduced. • Replay functionality: Computations and results can evolve by replaying the historical data from a stream. • Data pipeline must guarantee that events stay in order from generation to ingestion. This is critical to guarantee consistency of results, as this guarantees deterministic computation results. Running the same data twice through a computation must produce the same result Source: MapRCopyright © William El Kaim 2016 80
  81. 81. Hybrid computation: Kappa Architecture • Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows for multiple subscribers. • For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days. • When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table. • When the second job has caught up, switch the application to read from the new table. • Stop the old version of the job, and delete the old output table. Source: KrepsCopyright © William El Kaim 2016 81
  82. 82. Kappa Architecture Example Source: TrivadisCopyright © William El Kaim 2016 82
  83. 83. Hybrid computation: Lambda vs. Kappa Lambda Kappa Source: Kreps Used to value all data in a unique treatment chain Used to provide the freshest data to customers Copyright © William El Kaim 2016 83
  84. 84. Hybrid computation: Lambda vs. Kappa Lambda Kappa Source: Ericsson Copyright © William El Kaim 2016 84
  85. 85. Hadoop Processing Paradigms Evolutions Source: Rubén Casado Tejedor Copyright © William El Kaim 2016 85
  86. 86. • MapReduce • A programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. • Apache Hive • Provides a mechanism to project structure onto large data sets and to query the data using a SQL-like language called HiveQL. • Apache Spark • An open-source engine developed specifically for handling large-scale data processing and analytics. • Apache Storm • A system for processing streaming data in real time that adds reliable real-time data processing capabilities to Enterprise Hadoop. Processing Technologies Copyright © William El Kaim 2016 86
  87. 87. Processing Technologies • Apache Drill • Called the Omni-SQL: Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage • An open-source software framework that supports data intensive distributed applications for interactive analysis of large-scale datasets • Apache Pig • Platform for analyzing large data sets • High-level procedural language for expressing data analysis programs. • Pig Latin: Data flow programming language. • Cacading • Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows • Ease development of complex Hadoop MapReduce workflows • In the same way as Pig Source: Dataiku Copyright © William El Kaim 2016 87
  88. 88. Processing Technologies • Apache Drill is an engine that can connect to many different data sources, and provide a SQL interface to them. • standard data sources that you'd be able to query with SQL, like Oracle or MySQL • can also work with flat files such as CSV or JSON • as well as Avro and Parquet formats. • It's capability to run SQL against files is a great feature. • Example of how to use Drill here. • Apache OMID • Contributed to ASF by Yahoo • Omid provides a high-performant ACID transactional framework with Snapshot Isolation guarantees on top of HBase, being able to scale to thousands of clients triggering transactions on application data. • It’s one of the few open-source transactional frameworks that can scale beyond 100K transactions per second on mid-range hardware while incurring minimal impact on the latency accessing the datastore. Copyright © William El Kaim 2016 88
  89. 89. Stream Processing: HortonWorks Dataflow • Hortonworks DataFlow is an integrated platform to collect, conduct and curate real-time data, moving it from any source to any destination. Source: HortonWorksCopyright © William El Kaim 2016 89
  90. 90. Stream Processing: HortonWorks Dataflow Source: HortonWorksCopyright © William El Kaim 2016 90
  91. 91. Streaming PaaS: StreamTools http://blog.nytlabs.com/streamtools/Copyright © William El Kaim 2016 91
  92. 92. Streaming PaaS: Striim http://www.striim.com/Copyright © William El Kaim 2016 92
  93. 93. Streaming PaaS: StreamAnalytix http://streamanalytix.com/Copyright © William El Kaim 2016 93
  94. 94. Streaming PaaS: InsightEdge http://insightedge.io/Copyright © William El Kaim 2016 94
  95. 95. More Information • The Hadoop Ecosystem Table • Big Data Ingestion and Streaming Tools • Apache Storm vs. Spark Streaming • Data Science & Data Discovery Platforms Compared. Datameer and Dataiku DSS go head to head • Applying the Kappa architecture in the telco industry Copyright © William El Kaim 2016 95
  96. 96. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Big Data Fabric • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 96
  97. 97. Big Data Fabric Introduction • Definition: • Bringing together disparate big data sources automatically, intelligently, and securely, and processing them in a big data platform technology, such as Hadoop and Apache Spark, to deliver a unified, trusted, and comprehensive view of customer and business data. • Big data fabric focuses on automating the process of ingestion, curation, and integrating big data sources to deliver intelligent insights that are critical for businesses to succeed. • The platform minimizes complexity by automating processes, generating big data technology and platform code automatically, and integrating workflows to simplify the deployment. • Big data fabric is not just about Hadoop or Spark — it comprises several components, all of which must work in tandem to deliver a flexible, integrated, secure, and scalable platform. • Big data fabric architecture has secore layers Source: ForresterCopyright © William El Kaim 2016 97
  98. 98. Big Data Fabric Architecture Source: Eckerson Group Source: Forrester Copyright © William El Kaim 2016 98
  99. 99. Big Data Fabric Six core Architecture Layers • Data ingestion layer. • The data ingestion layer deals with getting the big data sources connected, ingested, streamed, and moved into the data fabric. • Big data can come from devices, sensors, logs, clickstreams, databases, applications, and various cloud sources, in the form of structured or unstructured data. • Processing and persistence layer. • This layer uses Hadoop, Spark, and other Hadoop ecosystem components such as Kafka, Flume, and Hive to process and persist big data for use within the big data fabric framework. • Orchestration layer. • The orchestration layer is a critical layer of the big data fabric that transforms, integrates, and cleans data to support various use cases in real time or near real time. • It can transform data inside Hadoop to enable integration, or it can match and clean data dynamically. • Data discovery layer. • This layer automates the discovery of new internal or external big data sources and presents them as a new data asset for consumption by business users. • Dynamic discovery includes several components such as data modeling, data preparation, curation, and virtualization to deliver a flexible big data platform to support any use case. • Data management and intelligence layer. • This layer enables end-to-end data management capabilities that are essential to ensuring the reliability, security, integration, and governance of data. • Its components include data security, governance, metadata management, search, data quality, and lineage. • Data access layer. • This layer includes caching and in-memory technologies, self-service capabilities and interactions, and fabric components that can be embedded in analytical solutions, tools, and dashboards. Source: ForresterCopyright © William El Kaim 2016 99
  100. 100. Big Data Fabric Adoption Is In Its Infancy • Most enterprises that have a big data fabric platform are building it themselves by integrating various core open source technologies • In addition, they are supporting the platform with commercial products for data integration, security, governance, machine learning, SQL-on-Hadoop, and data preparation technologies. • However, organizations are realizing that creating a custom technology stack to support a big data fabric implementation (and then customizing it to meet business requirements) requires significant time and effort. • Solutions are starting to emerge from vendors. Source: ForresterCopyright © William El Kaim 2016 100
  101. 101. Make Big Data Fabric Part Of Your Big Data Strategy! • Enterprise architects whose companies are pursuing a big data strategy can benefit from a big data fabric implementation that automates, secures, integrates, and curates big data sources intelligently. • Your big data fabric strategy should: • Integrate only a few big data sources at first. • Start top-down rather than bottom-up, keeping the end in mind. • Separate analytics from data management. Analytics tools should focus primarily on data visualization and advanced statistical/data mining algorithms with limited dependence on data management functions. Decoupling data management from data analytics reduces the time and effort needed to deliver trusted analytics. • Create a team of experts to ensure success. • Use automation and machine learning to accelerate deployment.. Source: ForresterCopyright © William El Kaim 2016 101
  102. 102. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Geo-Spatial-on-Hadoop • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 102
  103. 103. Geo-Spatial-on-Hadoop • ESRI • ESRI for Big Data • Esri GIS tools for Hadoop: Toolkit allowing developers to build analytical tools leveraging both Hadoop and Arcgis. • Esri User Defined Functions built on top of the Esri Geometry API • Pigeon: spatial extension to Pig that allows it to process spatial data. • Hive Spatial Query: adds geometric user-defined functions(UDFs) to Hive. • Geomesa • GeoMesa is an open-source, distributed, spatio-temporal database built on Accumulo, HBase, Cassandra, and Kafka. • SpatialHadoop • open source MapReduce extension designed specifically to handle huge datasets of spatial data on Apache Hadoop. • SpatialHadoop is shipped with built-in spatial high level language, spatial data types, spatial indexes and efficient spatial operations. Copyright © William El Kaim 2016 103
  104. 104. Geo-Spatial-on-Hadoop • GeoDataViz • CartoDB • Deep Insights technology is capable of handling and visualizing massive amounts of contextual and time based location data. • Spatialytics • Standard geoBI platform • mapD • Leverage GPU and a dedicated NoSQL database for better performance • deck.gl (Uber) • WebGL-powered framework for visual exploratory data analysis of large datasets. • Data Converter • ESRI GeoJSon Utils • GDAL: Geospatial Data Abstraction Library • Redis • Open source (BSD licensed), in- memory data structure store, used as database, cache and message broker. It supports data structures such as strings, hashes, lists, sets,sorted sets with range queries, bitmaps, hyperloglogs and geospatial indexes with radius queries. • Tutorial / Examples • How To Analyze Geolocation Data with Hive and Hadoop – Uber trips • Geo spatial data support for Hive using Taxi data in NYC • ESRI Wiki Copyright © William El Kaim 2016 104
  105. 105. Deck.GL Copyright © William El Kaim 2016 105http://uber.github.io/deck.gl/#/
  106. 106. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 106
  107. 107. Hadoop: Open Source Bazaar Style Dev. • Hadoop was first conceived at Yahoo as a distributed file system (HDFS) and a processing framework (MapReduce) for indexing the Internet. • It worked so well that other Internet firms in the Silicon Valley started using the open source software too. • Apache Hadoop, by all accounts, has been a huge success on the open source front. • Hadoop project has spawned off into dozens of Apache projects • Hive, Impala, Spark, HBase, Cassandra, Pig, Tez, etc. Copyright © William El Kaim 2016 107
  108. 108. Is there an Hadoop Standard? Apache Software Foundation Hadoop • Apache Software Foundation (ASF) is managing Apache Hadoop • The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. • Rather than rely on hardware to deliver high- availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. • http://hadoop.apache.org/ Source: Apache Software Foundation Copyright © William El Kaim 2016 108
  109. 109. Is there an Hadoop Standard? Open Data Platform Initiative • ODPi defines itself as "a shared industry effort focused on promoting and advancing the state of Apache Hadoop and big data technologies for the enterprise." • The group has grown its membership steadily since launching in February 2015 under the name Open Data Platform Alliance : • Ampool, Altiscale, Capgemini, DataTorrent, EMC, GE, Hortonworks, IBM, Infosys, Linaro, NEC, Pivotal, PLDT, SAS Institute Inc, Splunk, Squid Solutions, SyncSort, Telstra, Teradata, Toshiba, UNIFi, Verizon, VMware, WANdisco, Xiilab, zData and Zettaset. • ODPi takes a major step forward by securing official endorsement by the Linux Foundation turning it into a Linux Foundation collaborative project. • Major companies against OdPi are Amazon, Cloudera, and Mapr • Specifications • ODPi runtime specification (march 2016) and ODPI Operations Source: Odpi Copyright © William El Kaim 2016 109
  110. 110. Is there an Hadoop Standard? Open Data Platform Initiative • Objectives are : • Reinforces the role of the Apache Software Foundation (ASF) in the development and governance of upstream projects. • Accelerates the delivery of Big Data solutions by providing a well- defined core platform to target. • Defines, integrates, tests, and certifies a standard "ODPi Core" of compatible versions of select Big Data open source projects. • Provides a stable base against which Big Data solution providers can qualify solutions. • Produces a set of tools and methods that enable members to create and test differentiated offerings based on the ODPi Core. • Contributes to ASF projects in accordance with ASF processes and Intellectual Property guidelines. • Supports community development and outreach activities that accelerate the rollout of modern data architectures that leverage Apache Hadoop®. • Will help minimize the fragmentation and duplication of effort within the industry. Source: Odpi Copyright © William El Kaim 2016 110
  111. 111. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop V1 • Hadoop Architecture Examples Copyright © William El Kaim 2016 111
  112. 112. Hadoop V1: Integration Options Batch & Scheduled Integration Near Real-Time Integration Existing Infrastructure HDFS Pig REST Hive HBase MapReduce HCatalog WebHDFS Databases & Warehouses Applications & Spreadsheets Visualization & Intelligence Flume Logs & Files Existing Infrastructure HDFS Pig Hive HBase MapReduce HCatalog Databases & Warehouses Applications & Spreadsheets Visualization & Intelligence Logs & Files Data Integration (Talend, Informatica) ODBC/JDBC SQOOP Source: HortonWorks Copyright © William El Kaim 2016 112
  113. 113. http://hadooper.blogspot.fr/ Copyright © William El Kaim 2016 113
  114. 114. • Hive - A data warehouse infrastructure than runs on top of Hadoop. Hive supports SQL queries, star schemas, partitioning, join optimizations, caching of data, etc. • Pig - A scripting language for processing Hadoop data in parallel. • MapReduce - Java applications that can process data in parallel. • Ambari - An open source management interface for installing, monitoring and managing a Hadoop cluster. Ambari has also been selected as the management interface for OpenStack. Hadoop V1: Technology Elements • HBase - A NoSQL columnar database for providing extremely hast scanning of column data for analytics. • Scoop, Flume - tools providing large data ingestion for Hadoop using SQL, streaming and REST API interfaces. • Oozie - A workflow manager and scheduler. • Zookeeper - A coordinator infrastructure • Mahout - a machine learning library supporting Recommendation, Clustering, Classification and Frequent Itemset mining. • Hue - is a Web interface that contains a file browser for HDFS, a Job Browser for YARN, an HBase Browser, Query Editors for Hive, Pig and Sqoop and a Zookeeper browser. Copyright © William El Kaim 2016 114
  115. 115. Hadoop V1: Technology Elements Source: Octo Technology Copyright © William El Kaim 2016 115
  116. 116. Hadoop V1 Issues • Availability • Hadoop 1.0 Architecture had only one single point of availability i.e. the Job Tracker, so in case if the Job Tracker fails then all the jobs will have to restart. • Scalability • The Job Tracker runs on a single machine performing various tasks such as Monitoring, Job Scheduling, Task Scheduling and Resource Management. • In spite of the presence of several machines (Data Nodes), they were not being utilized in an efficient manner, thereby limiting the scalability of the system. • Multi-Tenancy • The major issue with Hadoop MapReduce that paved way for the advent of Hadoop YARN was multi-tenancy. With the increase in the size of clusters in Hadoop systems, the clusters can be employed for a wide range of models. • Cascading Failure • In case of Hadoop MapReduce when the number of nodes is greater than 4000 in a cluster, some kind of fickleness is observed. • The most common kind of failure that was observed is the cascading failure which in turn could cause the overall cluster to deteriorate when trying to overload the nodes or replicate data via network flooding. Source: DezyreCopyright © William El Kaim 2016 116
  117. 117. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop V2 • Hadoop Architecture Examples Copyright © William El Kaim 2016 117
  118. 118. Hadoop V2 • Hadoop (Hadoop 1.0) has progressed from a more restricted processing model of batch oriented MapReduce jobs to developing specialized and interactive processing models (Hadoop 2.0). • Hadoop 2.0 popularly known as YARN (Yet another Resource Negotiator) is the latest technology introduced in Oct 2013 Source: HortonWorks Copyright © William El Kaim 2016 118
  119. 119. Hadoop V2 • Apache™ Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks. • By eliminating unnecessary tasks, synchronization barriers, and reads from and write to HDFS, Tez speeds up data processing across both small-scale, low-latency and large- scale, high-throughput workloads. • Apache™ Slider is an engine that runs other applications in a YARN environment. • With Slider, distributed applications that aren’t YARN-aware can now participate in the YARN ecosystem – usually with no code modification. • Slider allows applications to use Hadoop’s data and processing resources, as well as the security, governance, and operations capabilities of enterprise Hadoop. Copyright © William El Kaim 2016 119
  120. 120. Hadoop V2: YARN • YARN (Yet Another Resource Negotiator) is • the foundation for parallel processing in Hadoop. • Scalable to 10,000+ data node systems. • Supports different types of workloads such as batch, real-time queries (Tez), streaming, graphing data, in-memory processing, messaging systems, streaming video, etc. You can think of YARN as a highly scalable and parallel processing operating system that supports all kinds of different types of workloads. • Supports batch processing providing high throughput performing sequential read scans. • Supports real time interactive queries with low latency and random reads. Copyright © William El Kaim 2016 120
  121. 121. Hadoop V2: Full Stack Source: HortonWorksCopyright © William El Kaim 2016 121
  122. 122. Hadoop V2 (Another Stack Vision) Copyright © William El Kaim 2016 122
  123. 123. Hadoop V2: Spark Advantages • Spark replaces MapReduce. • MapReduce is inefficient at handling iterative algorithms as well as interactive data mining tools. • Spark is fast: uses memory differently and efficiently • Run programs up to 100x faster than MapReduce in memory, or 10x faster on disk • Spark excels at programming models • involving iterations, interactivity (including streaming) and more. • Spark offers over 80 high-level operators that make it easy to build parallel apps • Spark runs Everywhere • Runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3. Copyright © William El Kaim 2016 123
  124. 124. Hadoop V2: Spark Revolution Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. Copyright © William El Kaim 2016 124
  125. 125. Hadoop V2: Spark Stack Evolutions (2015) Source: Databricks Goal: unified engine across data sources, workloads and environments DataFrame is a distributed collection of data organized into named columns ML pipeline to define a sequence of data pre- processing, feature extraction, model fitting, and validation stages Copyright © William El Kaim 2016 125
  126. 126. Hadoop V2: Upcoming Spark V2 • Spark programming revolved around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. • So the original Spark core API did not always feel natural for the larger population of data analysts and data engineers, who worked mainly with SQL and statistical languages such as R. • Today, Spark provides higher level APIs for advanced analytics and data science, and supports five different languages, including SQL, Scala, Java, Python, R. • What makes Spark quite special in the distributed computing arena is the fact that different techniques such as SQL queries and machine learning can be mixed and combined together, even within the same script. • By using Spark, data scientists and engineers do not have to switch to different environments and tools for data pre-processing, SQL queries or machine learning algorithms. This fact boosts the productivity of data professionals and delivers better and simpler data processing solutions. Source: DatabricksCopyright © William El Kaim 2016 126
  127. 127. Spark V2: What’s New? • Apache Spark Datasets: a high-level table-like data abstraction. • Datasets feel more natural when reasoning about analytics and machine learning tasks, and can be addressed both via SQL queries as well as programmatically via APIs. • Programming APIs. • Machine Learning has a big emphasis in this new release. spark.mllib package is deprecated in favor of the new spark.ml package that focuses on pipeline based APIs and is based on DataFrames. • Machine Learning pipelines and models can now be persisted across all languages supported by Spark. • DataFrames and Datasets are now unified for Scala and Java programming languages under the new Datasets class, which also serves as an abstraction for structured streaming • The new Structured Streaming API aims to allow managing streaming data sets without added complexity. • Performance has also improved with the second generation Tungsten engine, allowing for up to 10 times faster execution. Source: DatabricksCopyright © William El Kaim 2016 127
  128. 128. Hadoop V1 vs. V2 Source: HortonWorks Video Copyright © William El Kaim 2016 128
  129. 129. Hadoop V1 vs. V2 • YARN has taken an edge over the cluster management responsibilities from MapReduce • now MapReduce just takes care of the Data Processing and other responsibilities are taken care of by YARN. Copyright © William El Kaim 2016 129
  130. 130. Hadoop V1 vs. V2: Map Reduce vs. Tez vs. Spark Source: Slim Baltagi Copyright © William El Kaim 2016 130
  131. 131. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop V3 • Hadoop Architecture Examples Copyright © William El Kaim 2016 131
  132. 132. Hadoop V3 • The Apache Hadoop project recently announced its 3.0.0-alpha1 release. • Given the scope of a new major release, the Apache Hadoop community decided to release a series of alpha and beta releases leading up to 3.0.0 GA. • This gives downstream applications and end users an opportunity to test and provide feedback on the changes, which can be incorporated during the alpha and beta process. • The 3.0.0-alpha1 release incorporates thousands of new fixes, improvements, and features since the previous minor release, 2.7.0, which was released over a year ago. • The full changelog and release notes are available on the Hadoop website, but we’d like to drill into the major new changes that landed in 3.0.0-alpha1. Copyright © William El Kaim 2016 132
  133. 133. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop Additional Services • Hadoop Architecture Examples Copyright © William El Kaim 2016 133
  134. 134. Hadoop Performance Benchmark • TPCx-BB is an Express Benchmark to measure the performance of Hadoop based Big Data systems. • It measures the performance of both hardware and software components by executing 30 frequently performed analytical queries in the context of retailers with physical and online store presence. • The queries are expressed in SQL for structured data and in machine learning algorithms for semi-structured and unstructured data. • The SQL queries can use Hive or Spark, while the machine learning algorithms use machine learning libraries, user defined functions, and procedural programs. • The latest TPCx-BB Specification / The benchmark kit • TPC-DS queries which are largely based on SQL:2003 specification are now supported in Spark 2.0 Copyright © William El Kaim 2016 134
  135. 135. Developing Applications on Hadoop • Dedicated Application Stack for Hadoop • Casc • Cascading • Crunch • Hfactory • Hunk • Spring for Hadoop Copyright © William El Kaim 2016 135
  136. 136. Developing Applications on Hadoop Example: Casc Copyright © William El Kaim 2016 136
  137. 137. Hadoop Security • Apache Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform. • With the advent of Apache YARN, the Hadoop platform can now support a true data lake architecture. Enterprises can potentially run multiple workloads, in a multi tenant environment. • Apache Metron provides a scalable advanced security analytics framework built with the Hadoop Community evolving from the Cisco OpenSOC Project. • A cyber security application framework that provides organizations the ability to detect cyber anomalies and enable organizations to rapidly respond to identified anomalies. • Apache Sentry is a system to enforce fine grained role based authorization to data and metadata stored on a Hadoop cluster. Copyright © William El Kaim 2016 137
  138. 138. Hadoop Security • Apache Eagle: Analyze Big Data Platforms For Security and Performance • Apache Eagle is an Open Source Monitoring Platform for Hadoop ecosystem, which started with monitoring data activities in Hadoop. • It can instantly identify access to sensitive data, recognize attacks/malicious activities and blocks access in real time. • In conjunction with components (such as Ranger, Sentry, Knox, DgSecure and Splunk etc.), Eagle provides comprehensive solution to secure sensitive data stored in Hadoop. • As of 0.3.0, Eagle stores metadata and statistics into HBASE, and support Druid as metric store. Copyright © William El Kaim 2016 138
  139. 139. Hadoop Governance Data Governance Initiative • Enterprises adopting modern data architecture with Hadoop must reconcile data management realities when they bring existing and new data from disparate platforms under management. • As customers deploy Hadoop into corporate data and processing environments, metadata and data governance must be vital parts of any enterprise-ready data lake. • Data Governance Initiative (DGI) • with Aetna, Merck, Target, and SAS • Introduce a common approach to Hadoop data governance into the open source community. • Shared framework to shed light on how users access data within Hadoop while interoperating with and extending existing third-party data governance and management tools. • A new project proposed to the apache software foundation: Apache Atlas Copyright © William El Kaim 2016 139
  140. 140. Hadoop Governance Data Governance Initiative Copyright © William El Kaim 2016 140
  141. 141. Hadoop Governance Apache Atlas and Apache Falcon • Apache Atlas is a scalable and extensible set of core foundational governance services • It enables enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem. • Apache Falcon is a framework for managing data life cycle in Hadoop clusters • addresses enterprise challenges related to Hadoop data replication, business continuity, and lineage tracing by deploying a framework for data management and processing. • Falcon centrally manages the data lifecycle, facilitate quick data replication for business continuity and disaster recovery and provides a foundation for audit and compliance by tracking entity lineage and collection of audit logs. Copyright © William El Kaim 2016 141
  142. 142. Hadoop Governance Apache Atlas Capabilities Source: Apache AtlasCopyright © William El Kaim 2016 142
  143. 143. Hadoop Governance Other Vendors Entering The Market • Alation • Cloudera Navigator • Collibra • Informatica Big Data Management • Podium Data • Zaloni Copyright © William El Kaim 2016 143
  144. 144. Hadoop Governance Cloudera Navigator https://www.cloudera.com/products/cloudera-navigator.htmlCopyright © William El Kaim 2016 144
  145. 145. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Market • Hadoop Architecture Examples Copyright © William El Kaim 2016 145
  146. 146. Big Data Market • The global big data market will grow from $18.3 billion in 2014 to a whopping $92.2 billion by 2026 (4.4% annual growth rate). • 2015 “a breakthrough year for big data”, with the market growing by 23.5 percent, led mainly by Hadoop platform revenues. • Explosive growth of Hortonworks Inc. and other Hadoop vendors, as well as the rapid adoption of Apache Spark and other streaming technologies. • This growth in big data is being fueled by a desire among larger enterprises to become more data-driven, as well as the emergence of new, Web-based, cloud-native startups like AirBnB Inc., Netflix Inc. and Uber Technologies Inc., which were conceived with big data at their core. • 2016 – 2026 Worldwide big data Market Forecast by Wikibon. Copyright © William El Kaim 2016 146
  147. 147. Big Data Market • Wikibon breaks down global big data market revenues into three segments: professional services (40% of all revenues in 2015), hardware (31%) and software (29%). • Wikibon’s projection for 2026 shows a markedly different split, with rapid growth in big data-related software set to ensure that that segment overtakes the other two to account for 46% of all big data spending in the next ten years, with professional services at 29% and hardware at 25%. • This shift will occur due to the development of better quality software that reduces the need for big data-related services. • 2016 – 2026 Worldwide big data Market Forecast by Wikibon. Copyright © William El Kaim 2016 147
  148. 148. Big Data in Public Cloud Market • Worldwide Big Data revenue in the public cloud • was $1.1B in 2015 and will grow to $21.8B by 2026 • Will grow from 5% of all Big Data revenue in 2015 to 24% of all Big Data spending by 2026. • However, the report highlights ongoing regulatory concerns as well as the structural impediment of moving large amounts of data offsite as inhibitors to mass adoption of Big Data deployments in the public cloud. • “Big Data in the Public Cloud Forecast, 2016-2026” by Wikibon Copyright © William El Kaim 2016 148
  149. 149. Big Data Market: 2014-2016 ($B) Copyright © William El Kaim 2016 149
  150. 150. Hadoop Market • According to Wikibon’s latest market analysis • spending on Hadoop software and subscriptions accounted for a mere $187 million in 2014 • less than 1 percent of $27.4 billion in overall big data spending. • Hadoop spending on software and subscriptions to grow to $677 million by 2017 when the overall big data market will have grown to $50 billion = 1% and if you include professional services, it more than doubles to about 3%. • Source • Wikibon’s Big Data Vendor Revenue and Market Forecast 2011-2020 report. Copyright © William El Kaim 2016 150
  151. 151. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Tools landscape • Hadoop Architecture Examples Copyright © William El Kaim 2016 151
  152. 152. Big Data Landscape Hadoop Distributions and Providers • Three Main pure-play Hadoop distributors • Cloudera, Hortonworks, and MapR Technologies • Other Hadoop distributors • SyncFusion: Hadoop for Windows, • Pivotal Big Data Suite • Pachyderm • Hadoop Cloud Provider: • Altiscale, Amazon EMR, BigStep, Google Cloud DataProc, HortonWorks SequenceIQ, IBM BigInsights, Microsoft HDInsight, Oracle Big Data, Qubole, Rackspace • Hadoop Infrastructure as a Service • BlueData, Packet Copyright © William El Kaim 2016 152
  153. 153. Big Data Landscape Hadoop Distributions and Providers Forrester Big Data Hadoop Cloud, Q1 2016Forrester Big Data Hadoop Distributions, Q1 2016 Copyright © William El Kaim 2016 153
  154. 154. Big Data Landscape Hadoop Distributions To Start With • Apache Hadoop • Cloudera Live • Dataiku • Hortonworks Sandbox • IBM BigInsights • MapR Sandbox • Microsoft Azure HDInsight • Syncfusion Hadoop for Windows • W3C Big Data Europe Platform Copyright © William El Kaim 2016 154
  155. 155. Source: Matt Turck Copyright © William El Kaim 2016 155
  156. 156. Big Data Landscape Cloud Provisioning: HortonWorks CloudBreak • CloudBreak is a tool for provisioning Hadoop clusters on public cloud infrastructure and for optimizing the use of cloud resources with elastic scaling • Part of the HortonWorks Data Platform and powered by Apache Ambari, CloudBreak allows enterprises to simplify the provisioning of clusters in the cloud. Source: HortonWorksCopyright © William El Kaim 2016 156
  157. 157. Big Data Landscape Pure-play Hadoop Distributors: SyncFusion https://www.syncfusion.com/products/big-dataCopyright © William El Kaim 2016 157
  158. 158. Big Data Landscape Pure-play Hadoop Distributors: Pivotal Big Data Suite http://pivotal.io/big-data/pivotal-big-data-suiteCopyright © William El Kaim 2016 158
  159. 159. Big Data Landscape Microsoft Azure HDInsight https://azure.microsoft.com/en-us/services/hdinsight/Copyright © William El Kaim 2016 159
  160. 160. Big Data Landscape SQL Server 2016 Polybase • PolyBase is a technology that accesses and combines both non-relational and relational data, all from within SQL Server. • Allows to run queries on external data in Hadoop or Azure blob storage. • The queries are optimized to push computation to Hadoop • By simply using Transact-SQL (T-SQL) statements, you an import and export data back and forth between relational tables in SQL Server and non-relational data stored in Hadoop or Azure Blob Storage. Y • You can also query the external data from within a T-SQL query and join it with relational data. https://msdn.microsoft.com/en-us/library/mt163689.aspxCopyright © William El Kaim 2016 160
  161. 161. Big Data Landscape Pachyderm: Hadoop Alternative Container Based • San Francisco-based company founded in 2014 • raised $2 million from Data Collective, Blumberg Capital, Foundation Capital, and others. • The Pachyderm stack uses Docker containers as well as CoreOS and Kubernetes for cluster management. • In Hadoop, people write their jobs in Java, and it all runs on the JVM. • It replaces • MapReduce with Pachyderm Pipelines. • You create a containerized program with the tools of your choice that reads and writes to the local filesystem. • HDFS with its Pachyderm File System • Distributed file system (inspired from git), providing version control over all the data. • Data is stored in generic object storage such as Amazon’s S3, Google Cloud Storage or the Ceph file system. • Provides historical snapshots of how you data looked at different points in time. http://www.pachyderm.io/Copyright © William El Kaim 2016 161
  162. 162. Big Data Landscape PNDA: big data analytics platform for networks and services. http://pnda.io/Copyright © William El Kaim 2016 162
  163. 163. Big Data Landscape BlueData: Hadoop As a Service Container Based • BlueData will offer a Big- Data-as-a-Service (BDaaS) software platform • can deliver any Big Data distribution and application on any infrastructure, whether on-premises or in the public cloud. • Use Docker containers (secure, embedded, and fully managed) to be agnostic about the infrastructure – whether physical server, virtual machine, and now cloud at scale Source: BlueDataCopyright © William El Kaim 2016 163
  164. 164. Big Data Landscape Hadoop Infrastructure as a Service: BlueData Source: BlueDataCopyright © William El Kaim 2016 164
  165. 165. Big Data Landscape Big Data As a Service: Qubole http://www.qubole.com/Copyright © William El Kaim 2016 165
  166. 166. Big Data Landscape Big Data As a Service: BigStep http://bigstep.com/solutions/architectures Real Time Batch Copyright © William El Kaim 2016 166
  167. 167. Big Data Landscape Cloud Provisioning: Apache Ambari + Apache Brooklyn Apache Brooklyn is an application blueprinting and management system which supports a wide range of software and services in the cloud. Source: TheNewStackCopyright © William El Kaim 2016 167
  168. 168. Big Data Landscape Collecting and querying Hadoop Metrics Source: HortonWorksCopyright © William El Kaim 2016 168
  169. 169. Big Data Landscape Big Data application performance monitoring (APM) Source: DrivenCopyright © William El Kaim 2016 169
  170. 170. Big Data Landscape Ingestion Technologies: Apache NiFi Provides scalable directed graphs of data routing, transformation, and system mediation logic https://nifi.apache.org/Copyright © William El Kaim 2016 170
  171. 171. Big Data Landscape Data Wrangling Paxata Trifacta Copyright © William El Kaim 2016 171
  172. 172. Big Data Landscape Data Preparation Source: Bloor Research Copyright © William El Kaim 2016 172
  173. 173. Big Data Landscape Open Source Hadoop RDBMS: Splice Machine http://www.splicemachine.com/Copyright © William El Kaim 2016 173
  174. 174. Big Data Landscape Data integration On Demand: Xplenty https://www.xplenty.com/#featuresCopyright © William El Kaim 2016 174
  175. 175. Big Data Landscape Data integration On Demand: StreamSets Data Collector https://streamsets.com/ Copyright © William El Kaim 2016 175
  176. 176. Big Data Landscape Data integration On Demand: SnapLogic http://www.splicemachine.com/Copyright © William El Kaim 2016 176
  177. 177. Big Data Landscape Enterprise Integration: Informatica Vibe Informatica Vibe allows users to create data-integration mappings once, and then run them across multiple platforms. Copyright © William El Kaim 2016 177
  178. 178. Big Data Landscape Enterprise Integration: Tibco ActiveMatrix BusinessWorks Source: Tibco TIBCO ActiveMatrix BusinessWorks 6 + Apache Hadoop = Big Data Integration Copyright © William El Kaim 2016 178
  179. 179. Big Data Landscape IBM DataWorks • Available on Bluemix, IBM’s Cloud platform, DataWorks integrate and leverage Apache Spark, IBM Watson Analytics, and the IBM Data Science Experience. • It is designed to help organizations: • Automate the deployment of data assets and products using cognitive-based machine learning and Apache Spark; • Ingest data faster than any other data platform, from 50 to hundreds of Gbps, and all endpoints: enterprise databases, Internet of Things, weather, and social media; • Leverage an open ecosystem of more than 20 partners and technologies, such as Confluent, Continuum Analytics, Galvanize, Alation, NumFOCUS, RStudio, Skymind, and more. • Additionally, DataWorks is underpinned by core cognitive capabilities, such as cognitive- based machine learning. This helps speed up the process from data discovery to model deployment, and helps users uncover new insights that were previously hidden to them. Source: IBM DataworksCopyright © William El Kaim 2016 179
  180. 180. Big Data Landscape IBM DataWorks Source: IBM DataworksCopyright © William El Kaim 2016 180
  181. 181. Big Data Landscape Data Science: Anaconda Platform https://www.continuum.io/Copyright © William El Kaim 2016 181
  182. 182. Big Data Landscape Data Science: Dataiku DSS http://www.dataiku.com/dss/ Create Machine Learning Models Combine and Join Datasets Copyright © William El Kaim 2016 182
  183. 183. Big Data Landscape Data Science: Datameer http://datascience.ibm.com/ Copyright © William El Kaim 2016 183
  184. 184. Big Data Landscape Data Science: IBM Data Science Experience • Data Science Experience is a cloud- based development environment for near real-time, high performance analytics • Available on IBM Cloud Bluemix platform • Provides • 250 curated data sets • open source tools and a collaborative workspace like H2O, RStudio, Jupyter Notebooks on Apache Spark • in a single security-rich managed environment. • Help data scientists uncover and share meaningful insights with developers, making it easier to rapidly develop applications that are infused with intelligence. http://datascience.ibm.com/ Copyright © William El Kaim 2016 184
  185. 185. Big Data Landscape Data Science: Tamr http://www.tamr.com/ Copyright © William El Kaim 2016 185
  186. 186. Big Data Landscape Machine Learning as A Service • Open Source • Accord (Dotnet), Apache Mahout, Apache Samoa, Apache Spark MLlib and Mlbase, Apache SystemML, Cloudera Oryx, GoLearn (Go), H20, Photon ML, Prediction.io, R Hadoop, Scikit-learn (Python), Seldon, Shogun (C++), Google TensorFlow, Weka. • Available as a Service • Algorithmia, Algorithms.io, Amazon ML, BigML, DataRobot, FICO, Google Prediction API, HPE Haven OnDemand, IBM’s Watson Analytics, Microsoft Machine Learning Studio, PurePredictive, Predicsis, Yottamine. • Examples • BVA with Microsoft Azure ML • Quick Review of Amazon Machine Learning • BigML training Series • Handling Large Data Sets with Weka: A Look at Hadoop and Predictive Models Copyright © William El Kaim 2016 186
  187. 187. Scalable Data Science with R • Hadoop: Analyze data with Hadoop through R code (Rhadoop) • rhdfs to interact with HDFS systems; • rhbase to connect with Hbase; • plyrmr to perform common data transformation operations over large datasets; • rmr2 that provides a map-reduce API; • and ravro that writes and reads avro files. • Spark: with SparkR • It is possible to use Spark’s distributed computation engine to enable large-scale data analysis from the R shell. It provides a distributed data frame implementation that supports operations like selection, filtering, aggregation, etc., on large data sets. • Programming with Big Data in R • Programming with Big Data in R" project (pbdr) is based on MPI and can be used on high-performance computing (HPC) systems, providing a true parallel programming environment in R. Federico Castanedo Copyright © William El Kaim 2016 187
  188. 188. Scalable Data Science with R • After the data preparation step, the next common data science phase consists of training machine learning models, which can also be performed on a single machine or distributed among different machines. • In the case of distributed machine learning frameworks, the most popular approaches using R, are the following: • Spark MLlib: through SparkR, some of the machine learning functionalities of Spark are exported in the R package. • H2o framework: a Java-based framework that allows building scalable machine learning models in R or Python. • Apache MADlib (incubating): Big Data Machine Learning in SQL Copyright © William El Kaim 2016 188
  189. 189. Big Data Landscape Business Intelligence and Analytics Platforms Copyright © William El Kaim 2016 189
  190. 190. Big Data Landscape Business Intelligence and Analytics Platforms • Tableau, Qlikview and Jethro (SQL Acceleration Engine for BI on Big Data compatible with BI tools like Tableau and Qlik). • Alteryx, Birst, Datawatch, Domo, GoodData, Looker, PyramidAnalytics, Saagie and ZoomData are increasingly encroaching on the territory once claimed by Qlik and Tableau. • At the same time, a new crop of Hadoop and Spark data based BI tools from the likes of Platfora, Datameer, and Clearstory Data appeared on the market. • And the old guard is still there: Sap Lumira, Microsoft PowerBI, SAS Visual Analytics • And open source tools like Datawrapper Copyright © William El Kaim 2016 190
  191. 191. Big Data Landscape Business Intelligence and Analytics Platforms: Saagie Copyright © William El Kaim 2016 191https://www.saagie.com/products
  192. 192. Big Data Landscape Hadoop for Data Analytics and Use: Apache Zeppelin http://zeppelin.incubator.apache.org Apache Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data- driven, interactive and collaborative documents with SQL, Scala and more. Copyright © William El Kaim 2016 192
  193. 193. Big Data Landscape Dynamic Data Warehouse http://http://www.infoworks.io/Copyright © William El Kaim 2016 193
  194. 194. Big Data Landscape Data Visualization Software Visual analytics is the act of finding meaning in data using visual artifacts such as charts, graphs, maps and dashboards. In addition, the user interface is typically driven by drag and drop actions using wholly visual constructs. Copyright © William El Kaim 2016 194
  195. 195. Big Data Landscape Data Visualization Software • Four dominant modes of analysis: descriptive (traditional BI), discovery (looking for unknown facts), predictive (finding consistent patterns that can be used in future activities), and prescriptive (actions that can be taken to improve performance). Source: ButlerAnalytics • BeyondCore, BIME, ClearStory, DOMO, GoodData, Inetsoft, InfoCaptor, Logi Analytics, Looker, Microsoft Power BI, Microstrategy, Prognoz, Qlik Sense, SAP Lumira, SAS Visual Analytics, Sisense, Spotfire, Tableau, ThoughtSpot, Yellowfin. Source: ButlerAnalytics Copyright © William El Kaim 2016 195
  196. 196. Big Data Landscape Dataviz Tools • For Non Developers • ChartBlocks, Infogram, Plotly, Raw, Visual.ly • For Developers • D3.js, Infovis, Leaflet, NVD3, Processing.js, Recline.js, visualize.js • Chart.js, Chartist.js, Ember Charts , Google Charts, FusionCharts, Highcharts, n3- charts, Sigma JS, Polymaps • More • Datavisuaisation.ch curated list • ProfitBricks list • Dedicated library are also available for Python, Java, C#, Scala, etc. Copyright © William El Kaim 2016 196
  197. 197. Big Data Landscape Other Interesting Tools • Storage • Druid is an open-source analytics data store designed for OLAP queries on time series data (trillions of events, petabytes of data). • OpenTSDB (HBase) and Kairos: Time-series databases built on top of open-source nosql data stores. • Aerospike, VoltDB: Database software for handling large amounts of real-time event data. • Services • SyncSort Hadoop ETL Solution extends the capabilities of Hadoop. • Snowplow is an Event Analytics Platform. • IT monitoring: Graphistry, Splunk, SumoLogic, ScalingData, and CloudPhysics. • Modern monitoring platform using streaming analytics: Anodot, Graphite, DR-Elephant, and SignalFx http://www.snaplogic.com/Copyright © William El Kaim 2016 197
  198. 198. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Big Data Ecosystem For Science • Hadoop Architecture Examples Copyright © William El Kaim 2016 198
  199. 199. Big Data Ecosystem For Science • Large-scale data management is essential for experimental science and has been for many years. Telescopes, particle accelerators and detectors, and gene sequencers, for example, generate hundreds of petabytes of data that must be processed to extract secrets and patterns in life and in the universe. • The data technologies used in these various science communities often predate those in the rapidly growing industry big data world, and, in many cases, continue to develop independently, occupying a parallel big data ecosystem for science, supported by the National Energy Research Scientific Computing Centre (NERSC). • Across these projects we see a common theme: data volumes are growing, and there is an increasing need for tools that can effectively store and process data at such a scale. • In some cases, the projects could benefit from big data technologies being developed in industry, and in some other projects, the research itself will lead to new capabilities. Copyright © William El Kaim 2016 199Source: Wahid Bhimji on O’Reilly
  200. 200. Big Data Ecosystem For Science Copyright © William El Kaim 2016 200Source: Wahid Bhimji on O’Reilly
  201. 201. Big Data Ecosystem For Science Other Interesting Services • Data Format • ROOT offers a self-describing binary file format with huge flexibility for serialization of complex objects and column-wise data access. • HDF5 format to enable more efficient processing of simulation output due to the parallel input/output (I/O) capabilities • Data Federation • XrootD data access protocol, which allow all of data to be accessed in a single global namespace and served up in a mechanism that is both fault-tolerant and offering high- performance. • Data Management • Big PanDA run analyses that allow thousands of collaborators to run hundreds of thousands of processing steps on exabytes of data as well as monitor and catalog that activity. Copyright © William El Kaim 2016 201
  202. 202. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 202
  203. 203. Hadoop Architecture Metadata Source Data Computed Data Data Lake Data Preparation Data Science Tools & Platforms Open Data Operational Systems (ODS, IoT) Existing Sources of Data (Databases, DW, DataMart) Data Sourcing Dataiku, Tableau, Python, R, etc. Batch Streaming Data IngestionData Sources Data Driven Business Process, Applications and Services Lakeshore SQL & NoSQL Database Hadoop Spark BI Tools & Platforms Qlik, Tibco, IBM, SAP, BIME, etc. App. Services Cascading, Crunch, Hfactory, Hunk, Spring for Hadoop Feature Preparation Copyright © William El Kaim 2016 203
  204. 204. Hadoop Technologies NoSQL Databases: Cassandra, Ceph, DynamoDB, Hbase, Hive, Impala, Ring, OpenStack Swift, etc. Data Lake Data PreparationData Sourcing Data Science: Dataiku, Datameer, Tamr, R, SaS, Python, RapidMiner, etc. Data IngestionData Sources BI Tools & Platforms App. Services Cascading, Crunch, Hfactory, Hunk, Spring for Hadoop, D3.js, Leaflet Feature Preparation Ingestion Technologies: Apex, Flink, Flume, Kafka, Amazon Kinesis, Nifi, Samza, Spark, Sqoop, Scribe, Storm, NFS Gateway, etc. Distributed File System: GlusterFS, HDFS, Amazon S3, MapRFS, ElasticSearch Batch Streaming Encoding Format: JSON, Rcfile, Parquet, ORCfile Map Reduce Event Stream & Micro Batch Open Data Operational Systems (ODS, IoT) Existing Sources of Data (Databases, DW, DataMart) Distributions: Cloudera, HortonWorks, MapR, SyncFusion, Amazon EMR, Azure HDInsight, Altiscale, Pachyderm, Qubole, etc. Data Warehouse Lakeshore & Analytics Qlik, Tableau, Tibco, Jethro, Looker, IBM, SAP, BIME, etc. Cassandra, Druid, DynamoDB, MongoDB, Redshift, Google BigQuery, etc. Machine Learning: BigML, Mahout, Predicsys, Azure ML, TensorFlow, H2O, etc. Analytics App and Services Copyright © William El Kaim 2016Copyright © William El Kaim 2016 204
  205. 205. Intuit Example: Initial Cloud Platform Copyright © William El Kaim 2016 205
  206. 206. Intuit Example: Initial Platform Concerns Key data sources 1. Clickstream 2. Transactional user-entered data 3. Back office data and insights Key cross-cutting concerns 4. Traceability – customer ID, transaction ID 5. REACTive platform architecture 6. Analytics infrastructure 7. Model congruity 8. Sources of truth Copyright © William El Kaim 2016 206
  207. 207. Intuit Example: Revised Platform Copyright © William El Kaim 2016 207

×