SlideShare a Scribd company logo
1 of 41
Download to read offline
1 
Big Data Hoopla Simplified – 
Hadoop, MapReduce, NoSQL … 
TDWI Conference – Memphis, TN 
Oct 29, 2014 
© Talend 2014
2 
About the Presenter 
Rajan Kanitkar 
• Senior Solutions Engineer 
• Rajan Kanitkar is a Pre-Sales Consultant with Talend. He 
has been active in the broader Data Integration space for 
the past 15 years and has experience with several leading 
software companies in these areas. His areas of 
specialties at Talend include Data Integration (DI), Big 
Data (BD), Data Quality (DQ) and Master Data 
Management (MDM). 
• Contact: rkanitkar@talend.com 
© Talend 2014
3 
Big Data Ecosystem 
© Talend 2014
4 
Quick Reference – Big Data 
Hadoop: Apache Hadoop is an open-source software framework for storage and large 
scale processing of data-sets on clusters of commodity hardware. 
Hadoop v1.0 - Original version that focused on HDFS and MapReduce. The 
Resource Manager and Job Tracker were one entity. 
Hadoop v2.0 – Sometimes called MapReduce 2 (MRv2). Splits out the Resource 
Manager and job monitoring into two separate daemons. Also called YARN. This new 
architecture allows for other processing engines to be managed/monitored aside from 
just the MapReduce engine. 
© Talend 2014
5 
Quick Reference - Big Data 
• Hadoop: the core project 
• HDFS: the Hadoop Distributed File System 
• MapReduce: the software framework for distributed 
processing of large data sets 
• Hive: a data warehouse infrastructure that provides data 
summarization and a querying language 
• Pig: a high-level data-flow language and execution 
framework for parallel computation 
• HBase: this is the Hadoop database. Use it when you 
need random, realtime read/write access to your Big 
Data 
• And many many more: Sqoop, HCatalog, Zookeeper, 
Oozie, Cassandra, MongoDB, etc. 
© Talend 2014
6 
Hadoop Core – HDFS 
Metadata Operations 
Name Node Client 
Data Node 
© Talend 2014 
Block Block 
Block Block 
Data Node 
Block Block 
Block Block 
Data Node 
Block Block 
Block Block 
Data Node 
Block Block 
Block Block 
Read/Write 
Control 
Replicate
7 
Hadoop Core – MapReduce 
© Talend 2014 
The „Word Count Example“
8 
Quick Reference – Data Services 
HCatalog: a set of interfaces that open up access to Hive's metastore for tools inside and outside of the 
Hadoop grid. Hortonworks donated to Apache. In March 2013, merged with Hive. Enables users with 
different processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the 
cluster. 
Hbase: a non-relational, distributed database modeled after Google’s Big Table. Good at storing sparse 
data. Considered a key-value columnar database. Runs on top of HDFS. Useful for random real-time 
read/write access. 
Hive: a data warehouse infrastructure built on top of Hadoop. Provides data summarization, ad-hoc 
query, and analysis of large datasets. Allows to query data using a SQL-like language called HiveQL 
(HQL). 
Mahout: a library of scalable machine-learning algorithms, implemented on top of Hadoop. Mahout 
supports collaborative filtering, clustering, classification and item set mining. 
Pig: allows you to write complex MapReduce transformations using a Pig Latin scripting language. Pig 
Latin defines a set of transformations such as aggregate, join and sort. Pig translates the Pig Latin script 
into MapReduce so that it can be executed within Hadoop. 
SQOOP: utility for bulk data import/export between HDFS and structured data stores such as relational 
databases. 
© Talend 2014
9 
Quick Reference – Operational Services 
Oozie: Apache workflow scheduler for Hadoop. It allows for coordination between Hadoop jobs. A workflow in 
Oozie is defined in what is called a Directed Acyclical Graph (DAG). 
Zookeeper: a distributed, highly available coordination service. Allows distributed processes to coordinate with 
each other through a shared hierarchical name space of data registers (called znodes). Writing distributed 
applications is hard. It’s hard primarily because of partial failure. ZooKeeper gives you a set of tools to build 
distributed applications that can safely handle partial failures. 
Kerberos : a computer network authentication protocol which provides mutual authentication. The name is based 
on the three- headed dog . The three heads of Kerberos are 1) Key Distribution Center (KDC) 2) the client user 3) 
the server with the desired service to access. The KDC performs two service functions: Authentication (are you who 
you say you are) and the Ticket-Granting (gives you an expiring ticket that give you access to certain resources). A 
Kerberos principal is a unique identity to which Kerberos can assign tickets (like a username). A keytab is a file 
containing pairs of Kerberos principals and encrypted keys (these are derived from the Kerberos password). 
© Talend 2014
10 
MapReduce 2.0, YARN, Storm, Spark 
• Yarn: Ensures predictable performance & QoS for all apps 
• Enables apps to run “IN” Hadoop rather than “ON” 
• Streaming with Apache Storm 
• Mini-Batch and In-Memory with Apache Spark 
© Talend 2014 
Applications Run Natively IN Hadoop 
YARN (Cluster Resource Management) 
HDFS2 (Redundant, Reliable Storage) 
BATCH 
(MapReduce) 
INTERACTIVE 
(Tez) 
STREAMING 
(Storm, Spark) 
GRAPH 
(Giraph) 
NoSQL 
(MongoDB) 
EVENTS 
(Falcon) 
ONLINE 
(HBase) 
OTHER 
(Search) 
Source: Hortonworks
11 
Quick Reference – Hadoop 2.0 Additions 
Storm: distributed realtime computation system. A Storm cluster is similar to a Hadoop cluster. On 
Hadoop you run "MapReduce jobs". On Storm you run "topologies". Jobs and topologies are very 
different -- in that a MapReduce job eventually finishes, but a topology processes messages forever 
(or until you kill it). Storm can run on top of YARN. 
Spark: parallel computing program which can operate over any Hadoop input source: HDFS, 
HBase, Amazon S3, Avro, etc. Holds intermediate results in memory, rather than writing them to 
disk; this drastically reduces query return time. Like Hadoop cluster but supports more than just 
MapReduce. 
Tez: framework which allows for a complex directed-acyclic-graph of tasks for processing data and 
is built atop Apache Hadoop YARN. MapReduce is batch-oriented and unsuited for interactive query. 
Tez allows Hive and Pig to be used to process interactive queries on petabyte scale. Support for 
machine learning. 
© Talend 2014
12 
Apache Spark 
What is Spark? 
• Spark Is An In-Memory Cluster Computing Engine that includes an HDFS 
compatible in-memory file system. 
Hadoop MapReduce 
• Batch processing at scale 
• Storage: Hadoop HDFS 
• Runs on Hadoop 
© Talend 2014 
VS 
Spark 
• Batch, interactive, graph and real-time processing 
• Storage: – Hadoop HDFS, Amazon S3, Cassandra… 
• Runs on many platforms 
• Fast in-memory processing up to 100 x faster than MapReduce (M/R)
13 
Apache Storm 
What Is Storm? 
• Storm Is a Cluster Engine Executing Applications Performing Real-time 
Analysis of Streaming Data in Motion – Enabling the Internet of Things for 
data such as sensor data, aircraft parts data, traffic analysis etc 
Storm 
• Real-time stream processing at scale 
• Storage: None - Data in Motion 
• Runs on Hadoop or on its own cluster 
• Fast in-memory processing 
© Talend 2014 
VS 
Spark 
• Batch, interactive, graph and real-time processing 
• Storage: – Hadoop HDFS, Amazon S3, Cassandra… 
• Runs on many plaforms 
• Fast in-memory processing
14 
Quick Reference – Big Data 
Vendors: The Apache Hadoop eco-system is a collection of many projects. 
Because of the complexities, “for profit” companies have packaged, added, 
enhanced and tried to differentiate one another in the Hadoop world. The main 
players are: 
- Cloudera – CDH – Cloudera Distribution for Hadoop. Current version is CDH 
5.2 (includes YARN) 
- Hortonworks - HDP – Hortonworks Data Platform. Spun out of Yahoo in 
2001. Current version is HDP 2.2 (YARN) 
- MapR – M3 (Community), M5 (Enterprise), M7 (adds NoSQL). Apache 
Hadoop derivative. Uses NFS instead of HDFS. 
- Pivotal - GPHD – Greenplum Hadoop. Spun out of EMC in 2013. Current is 
Pivotal HD 2.0 (YARN) 
© Talend 2014
15 
Quick Reference – NoSQL 
NoSQL: A NoSQL database provides a mechanism for storage and 
retrieval of data that is modeled in means other than the tabular relations 
used in relational databases – document, graph, columnar databases. 
Excellent comparison of NoSQL databases by Kristof Kovacs: 
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis 
Includes a comparison of: 
- Cassandra 
- MongoDB 
- Riak 
- Couchbase 
- … and many more 
© Talend 2014
16 
Quick Reference – NoSQL 
Document Storage: stores documents that encapsulate and encode data in some 
standard format (including XML, YAML, and JSON as well as binary forms like BSON, 
PDF and Microsoft Office documents. Different implementations offer different ways of 
organizing and/or grouping documents. 
Documents are addressed in the database via a unique key that represents that 
document. Big feature is the database will offer an API or query language that will allow 
retrieval of documents based on their contents. 
CouchDB: Apache database that focuses on embracing the web. Uses JSON to store 
data, Javascript as its query language using MapReduce, and HTTP for an API. The 
HTTP API is a differentiator between CouchDB and Couchbase. 
Couchbase: designed to provide key-value or document access. Native JSON support. 
Membase + CouchDB = Couchbase. Couchbase architecture includes auto-sharding, 
memcache and 100% uptime redundancy over CouchDB alone. Couchbase has free 
version but is not open-source. 
MongoDB: JSON/BSON style documents with flexible schemas to store data. A 
“collection” is a grouping of MongoDB documents. Collections do not enforce document 
structures. 
© Talend 2014
17 
Quick Reference – NoSQL 
Column Storage: stores data tables as sections of columns of data rather than as 
rows of data. Good for finding or aggregating on large set of similar data. Column 
storage serializes all data for one column contiguous on disk (so very quick read of a 
column). Organization of your data REALLY matters in columnar storage. No 
restriction on number of columns. One row in relational may be many rows in 
columnar. 
Cassandra: Apache distributed database designed to handle large amounts of data 
across many commodity servers, providing high availability with no single point of 
failure. 
Dynamo: Amazon NoSQL database service. All data stored on solid state drives. Replicated 
across three timezones. Integrated with Amazon EMR and S3. Stores “Items” (collection of key-value 
© Talend 2014 
pairs) given an ID. 
Riak: a distributed fault-tolerant key-value database. HTTP/REST API. Can walk links (similar as 
graph). Best used for single-site scalability, availability and fault tolerance – places where even 
seconds of downtime hurt data collection. Good for point-of-sale or factory control system data 
collection. 
HBase: non-relational data store on top of Hadoop. Think of column as key and data as value. 
Must create the Column family on table create. Look on Advanced tab to create families, then use 
when writing data.
18 
Big Data Integration Landscape 
© Talend 2014
19 
Data-Driven Landscape 
© Talend 2014 
Hadoop & NoSQL 
Data Quality 
Latency & Velocity 
Expanding Data 
Volumes 
Master Data Consistency 
Lack of Talent / Skills 
Siloed Data due to 
SAAS 
No End-2-End meta-data 
visibility
20 
Macro Trends Revolutionizing 
the Integration Market 
© Talend 2014 
20 
The amount of data will grow 
50X from 2010 to 2020 
64% of enterprises surveyed 
indicate that they’re 
deploying or planning Big 
Data projects 
By 2020, 55% of CIOs will 
source all their critical apps 
in the Cloud 
Source: Gartner and Cisco reports
21 
The New Data Integration Economics 
“Big data is what 
happened when the cost 
of keeping information 
became less than the 
cost of throwing it away.” 
– Technology Historian George Dyson 
© Talend 2014 
45x 
savings. $1,000/TB 
for Hadoop vs 
$45,000/TB for 
traditional 
$600B 
revenue shift by 
2020 to companies 
that use big data 
effectively 
6x 
faster ROI using 
big data analytics 
tools vs 
traditional EDW 
600x 
active data. 
Neustar moved 
from storing 1% of 
data for 60 days to 
100% for one year
22 
Existing Infrastructures Under Distress: 
Architecturally and Economically 
Weblogs 
© Talend 2014 
Batch to 
real-time 
Standard 
Reports 
Ad-hoc 
Query Tools 
Data Mining 
MDD/OLAP 
Relational 
Systems/ERP 
Analytical 
Applications 
Data 
explosion 
Need more 
active data 
Legacy Systems 
Transform 
External Data 
Sources 
Metadata 
Data Marts 
(the data warehouse)
23 
Benefits of Hadoop and NoSQL 
NoSQL 
ERP 
DBMS 
/EDW 
© Talend 2014 
Data 
explosion 
Batch to 
Real-Time 
Longer 
active data 
IOT 
NoSQL 
Web 
Logs 
Data Marts 
(the data warehouse) 
Legacy 
Systems 
Standard 
Reports 
Ad-hoc 
Query Tools 
Data 
Mining 
MDD/OLAP 
Analytical 
Applications
24 
Top Big Data Challenges 
© Talend 2014 
Source: Gartner - Survey Analysis: Big Data Adoption in 2013 Shows Substance Behind 
the Hype - 12 September 2013 - G00255160 
“How To” 
Challenges
25 
Big Data Integration Capabilities 
© Talend 2014
26 
Top Big Data Challenges 
© Talend 2014 
Need Solutions that 
Address these 
Challenges 
Source: Gartner - Survey Analysis: Big Data Adoption in 2013 Shows Substance 
Behind the Hype - 12 September 2013 - G00255160
27 
Convergence, Big Data & Consumerization 
• Next-gen integration platforms need to be designed 
& architected with big data requirements in mind 
© Talend 2014 
ETL / ELT 
Parallelization 
Processing needs to be 
distributed & flexible 
Big data technologies need 
to be integrated seamlessly 
with existing integration 
investments 
RDBMS
28 
Big Data Integration Landscape 
© Talend 2014
29 
“I may say that this is the greatest 
factor: the way in which the 
expedition is equipped.” 
© Talend 2014 
Roald Amundsen 
race to the south pole, 1911 
Source of Roal Amundsen portrait: 
Norwegian National Library 
© Talend 2014 29
30 
Big Data Integration: Ingest – Transform – Deliver 
© Talend 2014 
iPaaS MDM 
HA Govern 
Security Meta 
Storm Kafka 
CXF Camel 
STANDARD-IZE 
MACHINE 
YARN (Cluster Resource Management) 
HDFS2 (Redundant, Reliable Storage) 
HIVE 
BATCH 
(MapReduce) 
INTERACTIVE 
(Tez) 
STREAMING 
(Storm, Spark) 
GRAPH 
(Giraph) 
NoSQL 
(MongoDB) 
Events 
(Falcon) 
ONLINE 
(HBase) 
OTHER 
(Search) 
TRANSFORM (Data Refinement) 
MAP PROFILE PARSE CLEANSE CDC 
LEARNING 
MATCH 
INGEST 
(Ingestion) 
SQOOP 
FLUME 
HDFS API 
HBase API 
DELIVER 
(as an API) 
Karaf ActiveMQ
31 
Big Data Integration and Processing 
© Talend 2014 
Analytics Dashboard 
LLooaadd ttoo HHDDFFSS 
BIG DATA 
(Integration) 
Federate to 
analytics 
HDFS Map/Reduce 
HADOOP 
Data from Various 
Source Systems 
Hive
32 
Important Objectives 
• Moving from hand-code to code generation – MapReduce, 
Pig, Hive, SQOOP etc. – using a graphical user interface 
• Zero footprint on the Hadoop cluster 
• Same graphical user interface for both standard data 
integration and Big Data integration 
© Talend 2014
33 
Trying to get from this… 
© Talend 2014
34 
“pure Hadoop” and MapReduce 
© Talend 2014 
Visual design in Map Reduce and optimize before 
deploying on Hadoop 
to this…
35 
Native Map/Reduce Jobs 
• Create graphical ETL patterns using native Map/Reduce 
© Talend 2014 
• Reduce the need for big 
data coding skills 
• Zero pre-installation on 
the Hadoop cluster 
• Hadoop is the “engine” 
for data processing
36 
Other Important Objectives 
Enables organizations to leverage existing skills such as 
Java and other open source languages 
A large collaborative community for support 
 A large number of components for data and applications including big data 
and NoSQL 
Works directly on Apache Hadoop API 
 Native support for YARN and Hadoop 2.0 support for better resource 
optimization 
Software created through open standards and development 
processes that eliminates vendor lock-in 
 Scalability, portability and performance come for “free” due to Hadoop 
© Talend 2014 
Page 36
37 
© Talend 2014 
Talend Solution for Big Data Integration
38 
Talend’s Solution 
© Talend 2014
39 
The Value of Talend for Big Data 
Leverage In-house Resources 
© Talend 2014 
- Easy-to-use familiar Eclipse-tools that generate big data code 
- 100% standards-based, open source 
- Lots of examples with a large collaborative community 
Big Data Ready 
- Native support for Hadoop, MapReduce, and NoSQL 
- 800+ connectors to all data sources 
- Built-in data quality, security and governance (Platform for Big Data) 
Lower Costs 
- A predictable and scalable subscription model 
- Based only on users (not CPUs or connectors) 
- Free to download, no runtimes to install on your cluster 
$
40 
Talend’s Value for Big Data 
• New frameworks like Spark and Storm are emerging on 
Hadoop and can run on other platforms 
• Companies want to accelerate big data processing and do 
more sophisticated workloads by exploiting in-memory 
capabilities via Spark and for analyzing real-time data in 
motion via Storm 
• Talend can generate Storm applications to analyze and 
filter data in real-time as well as use source data filtered 
by Storm applications 
• Talend can help customers rapidly exploit new Big Data 
technologies to reduce time to value while insulating them 
from future extensions and advancements 
© Talend 2014
41 
Thank You For Your Participation 
© Talend 2014

More Related Content

What's hot

Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 

What's hot (20)

ETL using Big Data Talend
ETL using Big Data Talend  ETL using Big Data Talend
ETL using Big Data Talend
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionThe DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to Production
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
 
Manipulating Data with Talend.
Manipulating Data with Talend.Manipulating Data with Talend.
Manipulating Data with Talend.
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Splice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakesSplice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakes
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Ingesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache OrcIngesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache Orc
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and SupportabilityHDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
Apache hive essentials
Apache hive essentialsApache hive essentials
Apache hive essentials
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 

Viewers also liked

Viewers also liked (12)

Essential Tools For Your Big Data Arsenal
Essential Tools For Your Big Data ArsenalEssential Tools For Your Big Data Arsenal
Essential Tools For Your Big Data Arsenal
 
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and ImpactTOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Simplifying Big Data ETL with Talend
Simplifying Big Data ETL with TalendSimplifying Big Data ETL with Talend
Simplifying Big Data ETL with Talend
 
Big data: Bringing competition policy to the digital era – Background note – ...
Big data: Bringing competition policy to the digital era – Background note – ...Big data: Bringing competition policy to the digital era – Background note – ...
Big data: Bringing competition policy to the digital era – Background note – ...
 
Talend Big Data Capabilities Overview
Talend Big Data Capabilities OverviewTalend Big Data Capabilities Overview
Talend Big Data Capabilities Overview
 
QlikView & Big Data
QlikView & Big DataQlikView & Big Data
QlikView & Big Data
 
Open Source ETL using Talend Open Studio
Open Source ETL using Talend Open StudioOpen Source ETL using Talend Open Studio
Open Source ETL using Talend Open Studio
 
Big Data Industry Insights 2015
Big Data Industry Insights 2015 Big Data Industry Insights 2015
Big Data Industry Insights 2015
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Big Data and Advanced Analytics
Big Data and Advanced AnalyticsBig Data and Advanced Analytics
Big Data and Advanced Analytics
 
Big Data
Big DataBig Data
Big Data
 

Similar to Big Data Hoopla Simplified - TDWI Memphis 2014

Similar to Big Data Hoopla Simplified - TDWI Memphis 2014 (20)

Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

Big Data Hoopla Simplified - TDWI Memphis 2014

  • 1. 1 Big Data Hoopla Simplified – Hadoop, MapReduce, NoSQL … TDWI Conference – Memphis, TN Oct 29, 2014 © Talend 2014
  • 2. 2 About the Presenter Rajan Kanitkar • Senior Solutions Engineer • Rajan Kanitkar is a Pre-Sales Consultant with Talend. He has been active in the broader Data Integration space for the past 15 years and has experience with several leading software companies in these areas. His areas of specialties at Talend include Data Integration (DI), Big Data (BD), Data Quality (DQ) and Master Data Management (MDM). • Contact: rkanitkar@talend.com © Talend 2014
  • 3. 3 Big Data Ecosystem © Talend 2014
  • 4. 4 Quick Reference – Big Data Hadoop: Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop v1.0 - Original version that focused on HDFS and MapReduce. The Resource Manager and Job Tracker were one entity. Hadoop v2.0 – Sometimes called MapReduce 2 (MRv2). Splits out the Resource Manager and job monitoring into two separate daemons. Also called YARN. This new architecture allows for other processing engines to be managed/monitored aside from just the MapReduce engine. © Talend 2014
  • 5. 5 Quick Reference - Big Data • Hadoop: the core project • HDFS: the Hadoop Distributed File System • MapReduce: the software framework for distributed processing of large data sets • Hive: a data warehouse infrastructure that provides data summarization and a querying language • Pig: a high-level data-flow language and execution framework for parallel computation • HBase: this is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data • And many many more: Sqoop, HCatalog, Zookeeper, Oozie, Cassandra, MongoDB, etc. © Talend 2014
  • 6. 6 Hadoop Core – HDFS Metadata Operations Name Node Client Data Node © Talend 2014 Block Block Block Block Data Node Block Block Block Block Data Node Block Block Block Block Data Node Block Block Block Block Read/Write Control Replicate
  • 7. 7 Hadoop Core – MapReduce © Talend 2014 The „Word Count Example“
  • 8. 8 Quick Reference – Data Services HCatalog: a set of interfaces that open up access to Hive's metastore for tools inside and outside of the Hadoop grid. Hortonworks donated to Apache. In March 2013, merged with Hive. Enables users with different processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the cluster. Hbase: a non-relational, distributed database modeled after Google’s Big Table. Good at storing sparse data. Considered a key-value columnar database. Runs on top of HDFS. Useful for random real-time read/write access. Hive: a data warehouse infrastructure built on top of Hadoop. Provides data summarization, ad-hoc query, and analysis of large datasets. Allows to query data using a SQL-like language called HiveQL (HQL). Mahout: a library of scalable machine-learning algorithms, implemented on top of Hadoop. Mahout supports collaborative filtering, clustering, classification and item set mining. Pig: allows you to write complex MapReduce transformations using a Pig Latin scripting language. Pig Latin defines a set of transformations such as aggregate, join and sort. Pig translates the Pig Latin script into MapReduce so that it can be executed within Hadoop. SQOOP: utility for bulk data import/export between HDFS and structured data stores such as relational databases. © Talend 2014
  • 9. 9 Quick Reference – Operational Services Oozie: Apache workflow scheduler for Hadoop. It allows for coordination between Hadoop jobs. A workflow in Oozie is defined in what is called a Directed Acyclical Graph (DAG). Zookeeper: a distributed, highly available coordination service. Allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (called znodes). Writing distributed applications is hard. It’s hard primarily because of partial failure. ZooKeeper gives you a set of tools to build distributed applications that can safely handle partial failures. Kerberos : a computer network authentication protocol which provides mutual authentication. The name is based on the three- headed dog . The three heads of Kerberos are 1) Key Distribution Center (KDC) 2) the client user 3) the server with the desired service to access. The KDC performs two service functions: Authentication (are you who you say you are) and the Ticket-Granting (gives you an expiring ticket that give you access to certain resources). A Kerberos principal is a unique identity to which Kerberos can assign tickets (like a username). A keytab is a file containing pairs of Kerberos principals and encrypted keys (these are derived from the Kerberos password). © Talend 2014
  • 10. 10 MapReduce 2.0, YARN, Storm, Spark • Yarn: Ensures predictable performance & QoS for all apps • Enables apps to run “IN” Hadoop rather than “ON” • Streaming with Apache Storm • Mini-Batch and In-Memory with Apache Spark © Talend 2014 Applications Run Natively IN Hadoop YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, Spark) GRAPH (Giraph) NoSQL (MongoDB) EVENTS (Falcon) ONLINE (HBase) OTHER (Search) Source: Hortonworks
  • 11. 11 Quick Reference – Hadoop 2.0 Additions Storm: distributed realtime computation system. A Storm cluster is similar to a Hadoop cluster. On Hadoop you run "MapReduce jobs". On Storm you run "topologies". Jobs and topologies are very different -- in that a MapReduce job eventually finishes, but a topology processes messages forever (or until you kill it). Storm can run on top of YARN. Spark: parallel computing program which can operate over any Hadoop input source: HDFS, HBase, Amazon S3, Avro, etc. Holds intermediate results in memory, rather than writing them to disk; this drastically reduces query return time. Like Hadoop cluster but supports more than just MapReduce. Tez: framework which allows for a complex directed-acyclic-graph of tasks for processing data and is built atop Apache Hadoop YARN. MapReduce is batch-oriented and unsuited for interactive query. Tez allows Hive and Pig to be used to process interactive queries on petabyte scale. Support for machine learning. © Talend 2014
  • 12. 12 Apache Spark What is Spark? • Spark Is An In-Memory Cluster Computing Engine that includes an HDFS compatible in-memory file system. Hadoop MapReduce • Batch processing at scale • Storage: Hadoop HDFS • Runs on Hadoop © Talend 2014 VS Spark • Batch, interactive, graph and real-time processing • Storage: – Hadoop HDFS, Amazon S3, Cassandra… • Runs on many platforms • Fast in-memory processing up to 100 x faster than MapReduce (M/R)
  • 13. 13 Apache Storm What Is Storm? • Storm Is a Cluster Engine Executing Applications Performing Real-time Analysis of Streaming Data in Motion – Enabling the Internet of Things for data such as sensor data, aircraft parts data, traffic analysis etc Storm • Real-time stream processing at scale • Storage: None - Data in Motion • Runs on Hadoop or on its own cluster • Fast in-memory processing © Talend 2014 VS Spark • Batch, interactive, graph and real-time processing • Storage: – Hadoop HDFS, Amazon S3, Cassandra… • Runs on many plaforms • Fast in-memory processing
  • 14. 14 Quick Reference – Big Data Vendors: The Apache Hadoop eco-system is a collection of many projects. Because of the complexities, “for profit” companies have packaged, added, enhanced and tried to differentiate one another in the Hadoop world. The main players are: - Cloudera – CDH – Cloudera Distribution for Hadoop. Current version is CDH 5.2 (includes YARN) - Hortonworks - HDP – Hortonworks Data Platform. Spun out of Yahoo in 2001. Current version is HDP 2.2 (YARN) - MapR – M3 (Community), M5 (Enterprise), M7 (adds NoSQL). Apache Hadoop derivative. Uses NFS instead of HDFS. - Pivotal - GPHD – Greenplum Hadoop. Spun out of EMC in 2013. Current is Pivotal HD 2.0 (YARN) © Talend 2014
  • 15. 15 Quick Reference – NoSQL NoSQL: A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases – document, graph, columnar databases. Excellent comparison of NoSQL databases by Kristof Kovacs: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis Includes a comparison of: - Cassandra - MongoDB - Riak - Couchbase - … and many more © Talend 2014
  • 16. 16 Quick Reference – NoSQL Document Storage: stores documents that encapsulate and encode data in some standard format (including XML, YAML, and JSON as well as binary forms like BSON, PDF and Microsoft Office documents. Different implementations offer different ways of organizing and/or grouping documents. Documents are addressed in the database via a unique key that represents that document. Big feature is the database will offer an API or query language that will allow retrieval of documents based on their contents. CouchDB: Apache database that focuses on embracing the web. Uses JSON to store data, Javascript as its query language using MapReduce, and HTTP for an API. The HTTP API is a differentiator between CouchDB and Couchbase. Couchbase: designed to provide key-value or document access. Native JSON support. Membase + CouchDB = Couchbase. Couchbase architecture includes auto-sharding, memcache and 100% uptime redundancy over CouchDB alone. Couchbase has free version but is not open-source. MongoDB: JSON/BSON style documents with flexible schemas to store data. A “collection” is a grouping of MongoDB documents. Collections do not enforce document structures. © Talend 2014
  • 17. 17 Quick Reference – NoSQL Column Storage: stores data tables as sections of columns of data rather than as rows of data. Good for finding or aggregating on large set of similar data. Column storage serializes all data for one column contiguous on disk (so very quick read of a column). Organization of your data REALLY matters in columnar storage. No restriction on number of columns. One row in relational may be many rows in columnar. Cassandra: Apache distributed database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Dynamo: Amazon NoSQL database service. All data stored on solid state drives. Replicated across three timezones. Integrated with Amazon EMR and S3. Stores “Items” (collection of key-value © Talend 2014 pairs) given an ID. Riak: a distributed fault-tolerant key-value database. HTTP/REST API. Can walk links (similar as graph). Best used for single-site scalability, availability and fault tolerance – places where even seconds of downtime hurt data collection. Good for point-of-sale or factory control system data collection. HBase: non-relational data store on top of Hadoop. Think of column as key and data as value. Must create the Column family on table create. Look on Advanced tab to create families, then use when writing data.
  • 18. 18 Big Data Integration Landscape © Talend 2014
  • 19. 19 Data-Driven Landscape © Talend 2014 Hadoop & NoSQL Data Quality Latency & Velocity Expanding Data Volumes Master Data Consistency Lack of Talent / Skills Siloed Data due to SAAS No End-2-End meta-data visibility
  • 20. 20 Macro Trends Revolutionizing the Integration Market © Talend 2014 20 The amount of data will grow 50X from 2010 to 2020 64% of enterprises surveyed indicate that they’re deploying or planning Big Data projects By 2020, 55% of CIOs will source all their critical apps in the Cloud Source: Gartner and Cisco reports
  • 21. 21 The New Data Integration Economics “Big data is what happened when the cost of keeping information became less than the cost of throwing it away.” – Technology Historian George Dyson © Talend 2014 45x savings. $1,000/TB for Hadoop vs $45,000/TB for traditional $600B revenue shift by 2020 to companies that use big data effectively 6x faster ROI using big data analytics tools vs traditional EDW 600x active data. Neustar moved from storing 1% of data for 60 days to 100% for one year
  • 22. 22 Existing Infrastructures Under Distress: Architecturally and Economically Weblogs © Talend 2014 Batch to real-time Standard Reports Ad-hoc Query Tools Data Mining MDD/OLAP Relational Systems/ERP Analytical Applications Data explosion Need more active data Legacy Systems Transform External Data Sources Metadata Data Marts (the data warehouse)
  • 23. 23 Benefits of Hadoop and NoSQL NoSQL ERP DBMS /EDW © Talend 2014 Data explosion Batch to Real-Time Longer active data IOT NoSQL Web Logs Data Marts (the data warehouse) Legacy Systems Standard Reports Ad-hoc Query Tools Data Mining MDD/OLAP Analytical Applications
  • 24. 24 Top Big Data Challenges © Talend 2014 Source: Gartner - Survey Analysis: Big Data Adoption in 2013 Shows Substance Behind the Hype - 12 September 2013 - G00255160 “How To” Challenges
  • 25. 25 Big Data Integration Capabilities © Talend 2014
  • 26. 26 Top Big Data Challenges © Talend 2014 Need Solutions that Address these Challenges Source: Gartner - Survey Analysis: Big Data Adoption in 2013 Shows Substance Behind the Hype - 12 September 2013 - G00255160
  • 27. 27 Convergence, Big Data & Consumerization • Next-gen integration platforms need to be designed & architected with big data requirements in mind © Talend 2014 ETL / ELT Parallelization Processing needs to be distributed & flexible Big data technologies need to be integrated seamlessly with existing integration investments RDBMS
  • 28. 28 Big Data Integration Landscape © Talend 2014
  • 29. 29 “I may say that this is the greatest factor: the way in which the expedition is equipped.” © Talend 2014 Roald Amundsen race to the south pole, 1911 Source of Roal Amundsen portrait: Norwegian National Library © Talend 2014 29
  • 30. 30 Big Data Integration: Ingest – Transform – Deliver © Talend 2014 iPaaS MDM HA Govern Security Meta Storm Kafka CXF Camel STANDARD-IZE MACHINE YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) HIVE BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, Spark) GRAPH (Giraph) NoSQL (MongoDB) Events (Falcon) ONLINE (HBase) OTHER (Search) TRANSFORM (Data Refinement) MAP PROFILE PARSE CLEANSE CDC LEARNING MATCH INGEST (Ingestion) SQOOP FLUME HDFS API HBase API DELIVER (as an API) Karaf ActiveMQ
  • 31. 31 Big Data Integration and Processing © Talend 2014 Analytics Dashboard LLooaadd ttoo HHDDFFSS BIG DATA (Integration) Federate to analytics HDFS Map/Reduce HADOOP Data from Various Source Systems Hive
  • 32. 32 Important Objectives • Moving from hand-code to code generation – MapReduce, Pig, Hive, SQOOP etc. – using a graphical user interface • Zero footprint on the Hadoop cluster • Same graphical user interface for both standard data integration and Big Data integration © Talend 2014
  • 33. 33 Trying to get from this… © Talend 2014
  • 34. 34 “pure Hadoop” and MapReduce © Talend 2014 Visual design in Map Reduce and optimize before deploying on Hadoop to this…
  • 35. 35 Native Map/Reduce Jobs • Create graphical ETL patterns using native Map/Reduce © Talend 2014 • Reduce the need for big data coding skills • Zero pre-installation on the Hadoop cluster • Hadoop is the “engine” for data processing
  • 36. 36 Other Important Objectives Enables organizations to leverage existing skills such as Java and other open source languages A large collaborative community for support A large number of components for data and applications including big data and NoSQL Works directly on Apache Hadoop API Native support for YARN and Hadoop 2.0 support for better resource optimization Software created through open standards and development processes that eliminates vendor lock-in Scalability, portability and performance come for “free” due to Hadoop © Talend 2014 Page 36
  • 37. 37 © Talend 2014 Talend Solution for Big Data Integration
  • 38. 38 Talend’s Solution © Talend 2014
  • 39. 39 The Value of Talend for Big Data Leverage In-house Resources © Talend 2014 - Easy-to-use familiar Eclipse-tools that generate big data code - 100% standards-based, open source - Lots of examples with a large collaborative community Big Data Ready - Native support for Hadoop, MapReduce, and NoSQL - 800+ connectors to all data sources - Built-in data quality, security and governance (Platform for Big Data) Lower Costs - A predictable and scalable subscription model - Based only on users (not CPUs or connectors) - Free to download, no runtimes to install on your cluster $
  • 40. 40 Talend’s Value for Big Data • New frameworks like Spark and Storm are emerging on Hadoop and can run on other platforms • Companies want to accelerate big data processing and do more sophisticated workloads by exploiting in-memory capabilities via Spark and for analyzing real-time data in motion via Storm • Talend can generate Storm applications to analyze and filter data in real-time as well as use source data filtered by Storm applications • Talend can help customers rapidly exploit new Big Data technologies to reduce time to value while insulating them from future extensions and advancements © Talend 2014
  • 41. 41 Thank You For Your Participation © Talend 2014