More Related Content Similar to Big Data Hoopla Simplified - TDWI Memphis 2014 (20) Big Data Hoopla Simplified - TDWI Memphis 20141. 1
Big Data Hoopla Simplified –
Hadoop, MapReduce, NoSQL …
TDWI Conference – Memphis, TN
Oct 29, 2014
© Talend 2014
2. 2
About the Presenter
Rajan Kanitkar
• Senior Solutions Engineer
• Rajan Kanitkar is a Pre-Sales Consultant with Talend. He
has been active in the broader Data Integration space for
the past 15 years and has experience with several leading
software companies in these areas. His areas of
specialties at Talend include Data Integration (DI), Big
Data (BD), Data Quality (DQ) and Master Data
Management (MDM).
• Contact: rkanitkar@talend.com
© Talend 2014
4. 4
Quick Reference – Big Data
Hadoop: Apache Hadoop is an open-source software framework for storage and large
scale processing of data-sets on clusters of commodity hardware.
Hadoop v1.0 - Original version that focused on HDFS and MapReduce. The
Resource Manager and Job Tracker were one entity.
Hadoop v2.0 – Sometimes called MapReduce 2 (MRv2). Splits out the Resource
Manager and job monitoring into two separate daemons. Also called YARN. This new
architecture allows for other processing engines to be managed/monitored aside from
just the MapReduce engine.
© Talend 2014
5. 5
Quick Reference - Big Data
• Hadoop: the core project
• HDFS: the Hadoop Distributed File System
• MapReduce: the software framework for distributed
processing of large data sets
• Hive: a data warehouse infrastructure that provides data
summarization and a querying language
• Pig: a high-level data-flow language and execution
framework for parallel computation
• HBase: this is the Hadoop database. Use it when you
need random, realtime read/write access to your Big
Data
• And many many more: Sqoop, HCatalog, Zookeeper,
Oozie, Cassandra, MongoDB, etc.
© Talend 2014
6. 6
Hadoop Core – HDFS
Metadata Operations
Name Node Client
Data Node
© Talend 2014
Block Block
Block Block
Data Node
Block Block
Block Block
Data Node
Block Block
Block Block
Data Node
Block Block
Block Block
Read/Write
Control
Replicate
7. 7
Hadoop Core – MapReduce
© Talend 2014
The „Word Count Example“
8. 8
Quick Reference – Data Services
HCatalog: a set of interfaces that open up access to Hive's metastore for tools inside and outside of the
Hadoop grid. Hortonworks donated to Apache. In March 2013, merged with Hive. Enables users with
different processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the
cluster.
Hbase: a non-relational, distributed database modeled after Google’s Big Table. Good at storing sparse
data. Considered a key-value columnar database. Runs on top of HDFS. Useful for random real-time
read/write access.
Hive: a data warehouse infrastructure built on top of Hadoop. Provides data summarization, ad-hoc
query, and analysis of large datasets. Allows to query data using a SQL-like language called HiveQL
(HQL).
Mahout: a library of scalable machine-learning algorithms, implemented on top of Hadoop. Mahout
supports collaborative filtering, clustering, classification and item set mining.
Pig: allows you to write complex MapReduce transformations using a Pig Latin scripting language. Pig
Latin defines a set of transformations such as aggregate, join and sort. Pig translates the Pig Latin script
into MapReduce so that it can be executed within Hadoop.
SQOOP: utility for bulk data import/export between HDFS and structured data stores such as relational
databases.
© Talend 2014
9. 9
Quick Reference – Operational Services
Oozie: Apache workflow scheduler for Hadoop. It allows for coordination between Hadoop jobs. A workflow in
Oozie is defined in what is called a Directed Acyclical Graph (DAG).
Zookeeper: a distributed, highly available coordination service. Allows distributed processes to coordinate with
each other through a shared hierarchical name space of data registers (called znodes). Writing distributed
applications is hard. It’s hard primarily because of partial failure. ZooKeeper gives you a set of tools to build
distributed applications that can safely handle partial failures.
Kerberos : a computer network authentication protocol which provides mutual authentication. The name is based
on the three- headed dog . The three heads of Kerberos are 1) Key Distribution Center (KDC) 2) the client user 3)
the server with the desired service to access. The KDC performs two service functions: Authentication (are you who
you say you are) and the Ticket-Granting (gives you an expiring ticket that give you access to certain resources). A
Kerberos principal is a unique identity to which Kerberos can assign tickets (like a username). A keytab is a file
containing pairs of Kerberos principals and encrypted keys (these are derived from the Kerberos password).
© Talend 2014
10. 10
MapReduce 2.0, YARN, Storm, Spark
• Yarn: Ensures predictable performance & QoS for all apps
• Enables apps to run “IN” Hadoop rather than “ON”
• Streaming with Apache Storm
• Mini-Batch and In-Memory with Apache Spark
© Talend 2014
Applications Run Natively IN Hadoop
YARN (Cluster Resource Management)
HDFS2 (Redundant, Reliable Storage)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, Spark)
GRAPH
(Giraph)
NoSQL
(MongoDB)
EVENTS
(Falcon)
ONLINE
(HBase)
OTHER
(Search)
Source: Hortonworks
11. 11
Quick Reference – Hadoop 2.0 Additions
Storm: distributed realtime computation system. A Storm cluster is similar to a Hadoop cluster. On
Hadoop you run "MapReduce jobs". On Storm you run "topologies". Jobs and topologies are very
different -- in that a MapReduce job eventually finishes, but a topology processes messages forever
(or until you kill it). Storm can run on top of YARN.
Spark: parallel computing program which can operate over any Hadoop input source: HDFS,
HBase, Amazon S3, Avro, etc. Holds intermediate results in memory, rather than writing them to
disk; this drastically reduces query return time. Like Hadoop cluster but supports more than just
MapReduce.
Tez: framework which allows for a complex directed-acyclic-graph of tasks for processing data and
is built atop Apache Hadoop YARN. MapReduce is batch-oriented and unsuited for interactive query.
Tez allows Hive and Pig to be used to process interactive queries on petabyte scale. Support for
machine learning.
© Talend 2014
12. 12
Apache Spark
What is Spark?
• Spark Is An In-Memory Cluster Computing Engine that includes an HDFS
compatible in-memory file system.
Hadoop MapReduce
• Batch processing at scale
• Storage: Hadoop HDFS
• Runs on Hadoop
© Talend 2014
VS
Spark
• Batch, interactive, graph and real-time processing
• Storage: – Hadoop HDFS, Amazon S3, Cassandra…
• Runs on many platforms
• Fast in-memory processing up to 100 x faster than MapReduce (M/R)
13. 13
Apache Storm
What Is Storm?
• Storm Is a Cluster Engine Executing Applications Performing Real-time
Analysis of Streaming Data in Motion – Enabling the Internet of Things for
data such as sensor data, aircraft parts data, traffic analysis etc
Storm
• Real-time stream processing at scale
• Storage: None - Data in Motion
• Runs on Hadoop or on its own cluster
• Fast in-memory processing
© Talend 2014
VS
Spark
• Batch, interactive, graph and real-time processing
• Storage: – Hadoop HDFS, Amazon S3, Cassandra…
• Runs on many plaforms
• Fast in-memory processing
14. 14
Quick Reference – Big Data
Vendors: The Apache Hadoop eco-system is a collection of many projects.
Because of the complexities, “for profit” companies have packaged, added,
enhanced and tried to differentiate one another in the Hadoop world. The main
players are:
- Cloudera – CDH – Cloudera Distribution for Hadoop. Current version is CDH
5.2 (includes YARN)
- Hortonworks - HDP – Hortonworks Data Platform. Spun out of Yahoo in
2001. Current version is HDP 2.2 (YARN)
- MapR – M3 (Community), M5 (Enterprise), M7 (adds NoSQL). Apache
Hadoop derivative. Uses NFS instead of HDFS.
- Pivotal - GPHD – Greenplum Hadoop. Spun out of EMC in 2013. Current is
Pivotal HD 2.0 (YARN)
© Talend 2014
15. 15
Quick Reference – NoSQL
NoSQL: A NoSQL database provides a mechanism for storage and
retrieval of data that is modeled in means other than the tabular relations
used in relational databases – document, graph, columnar databases.
Excellent comparison of NoSQL databases by Kristof Kovacs:
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Includes a comparison of:
- Cassandra
- MongoDB
- Riak
- Couchbase
- … and many more
© Talend 2014
16. 16
Quick Reference – NoSQL
Document Storage: stores documents that encapsulate and encode data in some
standard format (including XML, YAML, and JSON as well as binary forms like BSON,
PDF and Microsoft Office documents. Different implementations offer different ways of
organizing and/or grouping documents.
Documents are addressed in the database via a unique key that represents that
document. Big feature is the database will offer an API or query language that will allow
retrieval of documents based on their contents.
CouchDB: Apache database that focuses on embracing the web. Uses JSON to store
data, Javascript as its query language using MapReduce, and HTTP for an API. The
HTTP API is a differentiator between CouchDB and Couchbase.
Couchbase: designed to provide key-value or document access. Native JSON support.
Membase + CouchDB = Couchbase. Couchbase architecture includes auto-sharding,
memcache and 100% uptime redundancy over CouchDB alone. Couchbase has free
version but is not open-source.
MongoDB: JSON/BSON style documents with flexible schemas to store data. A
“collection” is a grouping of MongoDB documents. Collections do not enforce document
structures.
© Talend 2014
17. 17
Quick Reference – NoSQL
Column Storage: stores data tables as sections of columns of data rather than as
rows of data. Good for finding or aggregating on large set of similar data. Column
storage serializes all data for one column contiguous on disk (so very quick read of a
column). Organization of your data REALLY matters in columnar storage. No
restriction on number of columns. One row in relational may be many rows in
columnar.
Cassandra: Apache distributed database designed to handle large amounts of data
across many commodity servers, providing high availability with no single point of
failure.
Dynamo: Amazon NoSQL database service. All data stored on solid state drives. Replicated
across three timezones. Integrated with Amazon EMR and S3. Stores “Items” (collection of key-value
© Talend 2014
pairs) given an ID.
Riak: a distributed fault-tolerant key-value database. HTTP/REST API. Can walk links (similar as
graph). Best used for single-site scalability, availability and fault tolerance – places where even
seconds of downtime hurt data collection. Good for point-of-sale or factory control system data
collection.
HBase: non-relational data store on top of Hadoop. Think of column as key and data as value.
Must create the Column family on table create. Look on Advanced tab to create families, then use
when writing data.
19. 19
Data-Driven Landscape
© Talend 2014
Hadoop & NoSQL
Data Quality
Latency & Velocity
Expanding Data
Volumes
Master Data Consistency
Lack of Talent / Skills
Siloed Data due to
SAAS
No End-2-End meta-data
visibility
20. 20
Macro Trends Revolutionizing
the Integration Market
© Talend 2014
20
The amount of data will grow
50X from 2010 to 2020
64% of enterprises surveyed
indicate that they’re
deploying or planning Big
Data projects
By 2020, 55% of CIOs will
source all their critical apps
in the Cloud
Source: Gartner and Cisco reports
21. 21
The New Data Integration Economics
“Big data is what
happened when the cost
of keeping information
became less than the
cost of throwing it away.”
– Technology Historian George Dyson
© Talend 2014
45x
savings. $1,000/TB
for Hadoop vs
$45,000/TB for
traditional
$600B
revenue shift by
2020 to companies
that use big data
effectively
6x
faster ROI using
big data analytics
tools vs
traditional EDW
600x
active data.
Neustar moved
from storing 1% of
data for 60 days to
100% for one year
22. 22
Existing Infrastructures Under Distress:
Architecturally and Economically
Weblogs
© Talend 2014
Batch to
real-time
Standard
Reports
Ad-hoc
Query Tools
Data Mining
MDD/OLAP
Relational
Systems/ERP
Analytical
Applications
Data
explosion
Need more
active data
Legacy Systems
Transform
External Data
Sources
Metadata
Data Marts
(the data warehouse)
23. 23
Benefits of Hadoop and NoSQL
NoSQL
ERP
DBMS
/EDW
© Talend 2014
Data
explosion
Batch to
Real-Time
Longer
active data
IOT
NoSQL
Web
Logs
Data Marts
(the data warehouse)
Legacy
Systems
Standard
Reports
Ad-hoc
Query Tools
Data
Mining
MDD/OLAP
Analytical
Applications
24. 24
Top Big Data Challenges
© Talend 2014
Source: Gartner - Survey Analysis: Big Data Adoption in 2013 Shows Substance Behind
the Hype - 12 September 2013 - G00255160
“How To”
Challenges
26. 26
Top Big Data Challenges
© Talend 2014
Need Solutions that
Address these
Challenges
Source: Gartner - Survey Analysis: Big Data Adoption in 2013 Shows Substance
Behind the Hype - 12 September 2013 - G00255160
27. 27
Convergence, Big Data & Consumerization
• Next-gen integration platforms need to be designed
& architected with big data requirements in mind
© Talend 2014
ETL / ELT
Parallelization
Processing needs to be
distributed & flexible
Big data technologies need
to be integrated seamlessly
with existing integration
investments
RDBMS
29. 29
“I may say that this is the greatest
factor: the way in which the
expedition is equipped.”
© Talend 2014
Roald Amundsen
race to the south pole, 1911
Source of Roal Amundsen portrait:
Norwegian National Library
© Talend 2014 29
30. 30
Big Data Integration: Ingest – Transform – Deliver
© Talend 2014
iPaaS MDM
HA Govern
Security Meta
Storm Kafka
CXF Camel
STANDARD-IZE
MACHINE
YARN (Cluster Resource Management)
HDFS2 (Redundant, Reliable Storage)
HIVE
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, Spark)
GRAPH
(Giraph)
NoSQL
(MongoDB)
Events
(Falcon)
ONLINE
(HBase)
OTHER
(Search)
TRANSFORM (Data Refinement)
MAP PROFILE PARSE CLEANSE CDC
LEARNING
MATCH
INGEST
(Ingestion)
SQOOP
FLUME
HDFS API
HBase API
DELIVER
(as an API)
Karaf ActiveMQ
31. 31
Big Data Integration and Processing
© Talend 2014
Analytics Dashboard
LLooaadd ttoo HHDDFFSS
BIG DATA
(Integration)
Federate to
analytics
HDFS Map/Reduce
HADOOP
Data from Various
Source Systems
Hive
32. 32
Important Objectives
• Moving from hand-code to code generation – MapReduce,
Pig, Hive, SQOOP etc. – using a graphical user interface
• Zero footprint on the Hadoop cluster
• Same graphical user interface for both standard data
integration and Big Data integration
© Talend 2014
34. 34
“pure Hadoop” and MapReduce
© Talend 2014
Visual design in Map Reduce and optimize before
deploying on Hadoop
to this…
35. 35
Native Map/Reduce Jobs
• Create graphical ETL patterns using native Map/Reduce
© Talend 2014
• Reduce the need for big
data coding skills
• Zero pre-installation on
the Hadoop cluster
• Hadoop is the “engine”
for data processing
36. 36
Other Important Objectives
Enables organizations to leverage existing skills such as
Java and other open source languages
A large collaborative community for support
A large number of components for data and applications including big data
and NoSQL
Works directly on Apache Hadoop API
Native support for YARN and Hadoop 2.0 support for better resource
optimization
Software created through open standards and development
processes that eliminates vendor lock-in
Scalability, portability and performance come for “free” due to Hadoop
© Talend 2014
Page 36
37. 37
© Talend 2014
Talend Solution for Big Data Integration
39. 39
The Value of Talend for Big Data
Leverage In-house Resources
© Talend 2014
- Easy-to-use familiar Eclipse-tools that generate big data code
- 100% standards-based, open source
- Lots of examples with a large collaborative community
Big Data Ready
- Native support for Hadoop, MapReduce, and NoSQL
- 800+ connectors to all data sources
- Built-in data quality, security and governance (Platform for Big Data)
Lower Costs
- A predictable and scalable subscription model
- Based only on users (not CPUs or connectors)
- Free to download, no runtimes to install on your cluster
$
40. 40
Talend’s Value for Big Data
• New frameworks like Spark and Storm are emerging on
Hadoop and can run on other platforms
• Companies want to accelerate big data processing and do
more sophisticated workloads by exploiting in-memory
capabilities via Spark and for analyzing real-time data in
motion via Storm
• Talend can generate Storm applications to analyze and
filter data in real-time as well as use source data filtered
by Storm applications
• Talend can help customers rapidly exploit new Big Data
technologies to reduce time to value while insulating them
from future extensions and advancements
© Talend 2014