Big Data Technologies You Didn’t Know
About
About Us
• Emerging technology firm focused on helping enterprises build breakthrough
software solutions
• Building software solutions powered by disruptive enterprise software trends
-Machine learning and data science
-Cyber-security
-Enterprise IOT
-Powered by Cloud and Mobile
• Bringing innovation from startups and academic institutions to the enterprise
• Award winning agencies: Inc 500, American Business Awards, International
Business Awards
• Big data technologies you didn’t know about
• Apache Flink
• Apache Samza
• Google Cloud Data Flow
• StreamSets
• Tensor Flow
• Apache NiFi
• Druid
• LinkedIn WhereHows
• Microsoft Cognitive Services
Agenda
Two Goals…
Think Beyond Traditional Big Data Stacks
Learn from Companies Building Big Data Pipelines at Scale
Big Data pipelines in the enterprise
Areas of a Big Data Pipeline
Big Data
Pipeline
Data
Processing
Stream Data
Ingestion
Data
transformati
ons
Cognitive
Computing
Machine
Learning
High
Performance
Data Access
Data Processing
Technology Stacks You Know
But You Probably Didn’t Know About….
Apache Flink
• Apache Flink, like Apache Hadoop and Apache Spark, is a community-
driven open source framework for distributed Big Data Analytics.
• Apache Flink engine exploits data streaming and in-memory processing
and iteration operators to improve performance.
• Apache Flink has its origins in a research project called Stratosphere of
which the idea was conceived in 2008 by professor Volker Markl from the
Technische Universität Berlin in Germany.
• In German, Flink means agile or swift. Flink joined the Apache incubator
in April 2014 and graduated as an Apache Top Level Project (TLP) in
December 2014.
Apache Flink
• Declarativity
• Query optimization
• Efficient parallel in-
memory and out-of-
core algorithms
• Massive scale-out
• User Defined
Functions
• Complex data types
• Schema on read
• Streaming
• Iterations
• Advanced
Dataflows
• General APIs
Draws on concepts from
MPP Database
Technology
Draws on concepts from
Hadoop MapReduce
TechnologyAdd
Apache Flink
Apache Flink: An Example
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Stream Data Processing
Technology Stacks You Know
But You Probably Didn’t Know About….
Apache Samza
• Created by LinkedIn to address extend the capabilities of Apache Kafka
• Simple API
• Managed state
• Fault Tolerant
• Durable messaging
• Scalable
• Extensible
• Processor Isolation
Apache Samza: Overview
• Samza code runs as a Yarn job
• You implement the StreamTask
interface, which defines a
process() call.
• StreamTask runs inside a task
instance, which itself is inside a
Yarn container.
Apache Samza: Operators
• Filter records matching condition
• Map record ⇒ func(record)
• Join two/more datasets by key
• Group records with the same value in field
• Aggregate records within the same group
• Pipe job 1’s output ⇒ job 2’s input
• MapReduce assumes fixed dataset.
Can we adapt this to unbounded streams?
Apache Samza: Sample Code
Data Transformation
Technology Stacks You Know
But You Probably Didn’t Know About….
Google Cloud Data Flow
• Native Google Cloud data processing service
• Simple programming model for batch and
streamed data processing tasks
• Provides a data flow managed service to
control the execution of data processing
jobs
• Data processing jobs can be authored using
the Data Flow SDKs (Apache Beam)
Google Cloud Data Flow : Details
• A pipeline encapsulates an entire series of
computations that accepts some input data
from external sources, transforms that
data produces some output data.
• A PCollection abstracts a data unit in a
pipeline
• Sources and Sink abstract read and write
operations in a pipeline
• Google Data Flow provides management,
monitoring and security capabilities in
data pipelines
Google Cloud Data Flow is Based on Apache Beam
• 1. Portable - You can use the same code with
different runners (abstraction) and backends on
premise, in the cloud, or locally
• 2. Unified - Same unified model for batch and
stream processing
• 3. Advanced features - Event windowing,
triggering, watermarking, lateless, etc.
• 4. Extensible model and SDK - Extensible API;
can define custom sources to read and write in
parallel
But You Probably Didn’t Know About….
StreamSets Data Collector
• Data processing platform optimized for data
in motion
• Visual data flow authoring model
• Open source distribution model
• On-premise and cloud distributions
• Rich monitoring and management
interfaces
StreamSets Data Collector: Details
• Data collectors streams and process data
in real time using data pipelines
• A pipeline describes a data flow from
origin to destination
• A pipeline is composed of origins,
destinations and processors
• Extensibility model based on JavaScript
and Jython
• The lifecycle of a data collector can be
controlled via the administration console
Machine Learning
Technology Stacks You Know
But You Probably Didn’t Know About….
TensorFlow
• Second generation Machine Learning system, followed by DistBelief
• TensorFlow grew out of a project at Google, called Google Brain, aimed at
applying various kinds of neural network machine learning to products and
services across the company.
• An open source software library for numerical computation using data flow graphs
• Used in following projects at Google
1. DeepDream
2. RankBrain
3. Smart Reply
And many more..
TensorFlow: Details
• Data flow graphs describe mathematical computation
with a directed graph of nodes & edges
• Nodes in the graph represent mathematical
operations.
• Edges represent the multidimensional data arrays
(tensors) communicated between them.
• Edges describe the input/output relationships between
nodes.
• The flow of tensors through the graph is where
TensorFlow gets its name.
TensorFlow
• Tensor
• Variable
• Operation
• Session
• Placeholder
• TensorBoard
Fast Data Access
Technology Stacks You Know
But You Probably Didn’t Know About….
Druid
• Druid was started in 2011
• ‣ Power interactive data applications
• ‣ Multi-tenancy: lots of concurrent users
• ‣ Scalability: trillions events/day, sub-second queries
• ‣ Real-time analysis
• Key Features
• LOW LATENCY INGESTION
• FAST AGGREGATIONS
• ARBITRARY SLICE-N-DICE CAPABILITIES
• HIGHLY AVAILABLE
• APPROXIMATE & EXACT CALCULATIONS
Druid: Details
• Realtime Node
• Historical Node
• Broker Node
• Coordinator Node
• Indexing Service
Druid: Details
• Realtime Node
• Historical Node
• Broker Node
• Coordinator Node
• Indexing Service
• JSON based query language
Low Latency Data Flows
Technology Stacks You Know
But You Probably Didn’t Know About….
Apache NiFi
• Powerful and reliable system to process and
distribute data
• Directed graphs of data routing and
transformation
• Web-based User Interface for creating,
monitoring, & controlling data flows
• Highly configurable - modify data flow at
runtime, dynamically prioritize data
• Data Provenance tracks data through entire
system
• Easily extensible through development of
custom components
Apache NiFi: Architecture
Apache NiFi: Concepts
• FlowFile
• Unit of data moving through the system
• Content + Attributes (key/value pairs)
• Processor
• Performs the work, can access FlowFiles
• Connection
• Links between processors
• Queues that can be dynamically
prioritized
• Process Group
• Set of processors and their connections
• Receive data via input ports, send data
via output ports
Data Discovery
Technology Stacks You Know
But You Probably Didn’t Know About….
WhereHows
Linkedin WhereHows
• Where is my data? How did it get there?
• Enterprise data catalog
• Metadata search
• Collaboration
• Data lineage analysis
• Connectivity to many data sources and
ETL tools
• Powering Linkedin data discovery layer
Linkedin WhereHows: Architecture
• Web interface for data discovery
• API enabled
• Backend server that controls the metadata
crawling and integration with other
systems
Linkedin WhereHows: Data Lineage
• Collects metadata from ETL platforms and
scripts
• Sources include
• Pig
• MapReduce
• Informatica
• Teradata
• Visualizes the lineage information
associated with a data source
Cognitive Computing
Technology Stacks You Know
But You Probably Didn’t Know About….
Microsoft Cognitive Services
• Based on Project Oxford and Bing
• Offers 22 cognitive computing APIs
• Main categories include:
• Vision
• Speech
• Language
• Knowledge
• Search
• Integrated with Cortana Intelligence Suite
Microsoft Cognitive Services
Microsoft Cognitive Services: Developer Experience
• 22 different REST APIs
that abstract cognitive
capabilities
• SDKs for Windows, IOS,
Android and Python
• Open source
Summary
• The big data ecosystem is constantly evolving
• There are a lot of relevant new technologies beyond the traditional Hadoop-Spark stacks
• Big internet companies are leading innovation in the space
Thanks
http://Tellago.com
Info@Tellago.com

10 Big Data Technologies you Didn't Know About

  • 1.
    Big Data TechnologiesYou Didn’t Know About
  • 2.
    About Us • Emergingtechnology firm focused on helping enterprises build breakthrough software solutions • Building software solutions powered by disruptive enterprise software trends -Machine learning and data science -Cyber-security -Enterprise IOT -Powered by Cloud and Mobile • Bringing innovation from startups and academic institutions to the enterprise • Award winning agencies: Inc 500, American Business Awards, International Business Awards
  • 3.
    • Big datatechnologies you didn’t know about • Apache Flink • Apache Samza • Google Cloud Data Flow • StreamSets • Tensor Flow • Apache NiFi • Druid • LinkedIn WhereHows • Microsoft Cognitive Services Agenda
  • 4.
  • 5.
    Think Beyond TraditionalBig Data Stacks
  • 6.
    Learn from CompaniesBuilding Big Data Pipelines at Scale
  • 7.
    Big Data pipelinesin the enterprise
  • 8.
    Areas of aBig Data Pipeline Big Data Pipeline Data Processing Stream Data Ingestion Data transformati ons Cognitive Computing Machine Learning High Performance Data Access
  • 9.
  • 10.
  • 11.
    But You ProbablyDidn’t Know About….
  • 12.
    Apache Flink • ApacheFlink, like Apache Hadoop and Apache Spark, is a community- driven open source framework for distributed Big Data Analytics. • Apache Flink engine exploits data streaming and in-memory processing and iteration operators to improve performance. • Apache Flink has its origins in a research project called Stratosphere of which the idea was conceived in 2008 by professor Volker Markl from the Technische Universität Berlin in Germany. • In German, Flink means agile or swift. Flink joined the Apache incubator in April 2014 and graduated as an Apache Top Level Project (TLP) in December 2014.
  • 13.
    Apache Flink • Declarativity •Query optimization • Efficient parallel in- memory and out-of- core algorithms • Massive scale-out • User Defined Functions • Complex data types • Schema on read • Streaming • Iterations • Advanced Dataflows • General APIs Draws on concepts from MPP Database Technology Draws on concepts from Hadoop MapReduce TechnologyAdd
  • 14.
  • 15.
    Apache Flink: AnExample case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 16.
  • 17.
  • 18.
    But You ProbablyDidn’t Know About….
  • 19.
    Apache Samza • Createdby LinkedIn to address extend the capabilities of Apache Kafka • Simple API • Managed state • Fault Tolerant • Durable messaging • Scalable • Extensible • Processor Isolation
  • 20.
    Apache Samza: Overview •Samza code runs as a Yarn job • You implement the StreamTask interface, which defines a process() call. • StreamTask runs inside a task instance, which itself is inside a Yarn container.
  • 21.
    Apache Samza: Operators •Filter records matching condition • Map record ⇒ func(record) • Join two/more datasets by key • Group records with the same value in field • Aggregate records within the same group • Pipe job 1’s output ⇒ job 2’s input • MapReduce assumes fixed dataset. Can we adapt this to unbounded streams?
  • 22.
  • 23.
  • 24.
  • 25.
    But You ProbablyDidn’t Know About….
  • 26.
    Google Cloud DataFlow • Native Google Cloud data processing service • Simple programming model for batch and streamed data processing tasks • Provides a data flow managed service to control the execution of data processing jobs • Data processing jobs can be authored using the Data Flow SDKs (Apache Beam)
  • 27.
    Google Cloud DataFlow : Details • A pipeline encapsulates an entire series of computations that accepts some input data from external sources, transforms that data produces some output data. • A PCollection abstracts a data unit in a pipeline • Sources and Sink abstract read and write operations in a pipeline • Google Data Flow provides management, monitoring and security capabilities in data pipelines
  • 28.
    Google Cloud DataFlow is Based on Apache Beam • 1. Portable - You can use the same code with different runners (abstraction) and backends on premise, in the cloud, or locally • 2. Unified - Same unified model for batch and stream processing • 3. Advanced features - Event windowing, triggering, watermarking, lateless, etc. • 4. Extensible model and SDK - Extensible API; can define custom sources to read and write in parallel
  • 29.
    But You ProbablyDidn’t Know About….
  • 30.
    StreamSets Data Collector •Data processing platform optimized for data in motion • Visual data flow authoring model • Open source distribution model • On-premise and cloud distributions • Rich monitoring and management interfaces
  • 31.
    StreamSets Data Collector:Details • Data collectors streams and process data in real time using data pipelines • A pipeline describes a data flow from origin to destination • A pipeline is composed of origins, destinations and processors • Extensibility model based on JavaScript and Jython • The lifecycle of a data collector can be controlled via the administration console
  • 32.
  • 33.
  • 34.
    But You ProbablyDidn’t Know About….
  • 35.
    TensorFlow • Second generationMachine Learning system, followed by DistBelief • TensorFlow grew out of a project at Google, called Google Brain, aimed at applying various kinds of neural network machine learning to products and services across the company. • An open source software library for numerical computation using data flow graphs • Used in following projects at Google 1. DeepDream 2. RankBrain 3. Smart Reply And many more..
  • 36.
    TensorFlow: Details • Dataflow graphs describe mathematical computation with a directed graph of nodes & edges • Nodes in the graph represent mathematical operations. • Edges represent the multidimensional data arrays (tensors) communicated between them. • Edges describe the input/output relationships between nodes. • The flow of tensors through the graph is where TensorFlow gets its name.
  • 37.
    TensorFlow • Tensor • Variable •Operation • Session • Placeholder • TensorBoard
  • 38.
  • 39.
  • 40.
    But You ProbablyDidn’t Know About….
  • 41.
    Druid • Druid wasstarted in 2011 • ‣ Power interactive data applications • ‣ Multi-tenancy: lots of concurrent users • ‣ Scalability: trillions events/day, sub-second queries • ‣ Real-time analysis • Key Features • LOW LATENCY INGESTION • FAST AGGREGATIONS • ARBITRARY SLICE-N-DICE CAPABILITIES • HIGHLY AVAILABLE • APPROXIMATE & EXACT CALCULATIONS
  • 42.
    Druid: Details • RealtimeNode • Historical Node • Broker Node • Coordinator Node • Indexing Service
  • 43.
    Druid: Details • RealtimeNode • Historical Node • Broker Node • Coordinator Node • Indexing Service • JSON based query language
  • 44.
  • 45.
  • 46.
    But You ProbablyDidn’t Know About….
  • 47.
    Apache NiFi • Powerfuland reliable system to process and distribute data • Directed graphs of data routing and transformation • Web-based User Interface for creating, monitoring, & controlling data flows • Highly configurable - modify data flow at runtime, dynamically prioritize data • Data Provenance tracks data through entire system • Easily extensible through development of custom components
  • 48.
  • 49.
    Apache NiFi: Concepts •FlowFile • Unit of data moving through the system • Content + Attributes (key/value pairs) • Processor • Performs the work, can access FlowFiles • Connection • Links between processors • Queues that can be dynamically prioritized • Process Group • Set of processors and their connections • Receive data via input ports, send data via output ports
  • 50.
  • 51.
  • 52.
    But You ProbablyDidn’t Know About…. WhereHows
  • 53.
    Linkedin WhereHows • Whereis my data? How did it get there? • Enterprise data catalog • Metadata search • Collaboration • Data lineage analysis • Connectivity to many data sources and ETL tools • Powering Linkedin data discovery layer
  • 54.
    Linkedin WhereHows: Architecture •Web interface for data discovery • API enabled • Backend server that controls the metadata crawling and integration with other systems
  • 55.
    Linkedin WhereHows: DataLineage • Collects metadata from ETL platforms and scripts • Sources include • Pig • MapReduce • Informatica • Teradata • Visualizes the lineage information associated with a data source
  • 56.
  • 57.
  • 58.
    But You ProbablyDidn’t Know About….
  • 59.
    Microsoft Cognitive Services •Based on Project Oxford and Bing • Offers 22 cognitive computing APIs • Main categories include: • Vision • Speech • Language • Knowledge • Search • Integrated with Cortana Intelligence Suite
  • 60.
  • 61.
    Microsoft Cognitive Services:Developer Experience • 22 different REST APIs that abstract cognitive capabilities • SDKs for Windows, IOS, Android and Python • Open source
  • 62.
    Summary • The bigdata ecosystem is constantly evolving • There are a lot of relevant new technologies beyond the traditional Hadoop-Spark stacks • Big internet companies are leading innovation in the space
  • 63.