The Future of Hadoop: Security and Real-Time Analytics

S
The Hadoop Path
A short presentation on where Hadoop is going
By
Subash DSouza

Hadoop and Google
S Hadoop came out of seminal papers released by Google
in the early 2000’s viz. GFS, MapReduce and Big Table.
S To see where Hadoop is moving is to see where Google
has gone.
S Great keynote talk by M.C. Srivas of MapR next week
that addresses this question.

Jonathan Hsieh – Keynote talk at Big Data Camp LA 2014

Where I think Hadoop is
moving?
S Security
S Real Time Analytics

Security
S Hadoop vendors have become serious about security in the past
year
S Hortonworks’s acquisition of XA Secure
S Cloudera’s acquisition of Gazzang
S Kerberos has been the premise for authentication for quite some
time but things like audit control and MDM have been on the
horizon.
S With these acquisitions, Hadoop vendors have been positioning
themselves for a better security play.
S Cloudera has Apache Sentry, Hortonworks has Apache Knox.
S MapR supports security through authentication and authorization

Real Time Analytics
S Real Time Streaming
S Quickly ingest data as it comes in.
S Real Time Reporting
S Quickly process the ingested data.

Real Time Streaming
SStorm
SSpark Streaming
SSamza

Apache Storm
S One of the first streaming tools built.
S Very low latency, typically looking at 10-200 ms.
S Started by Nathan Marz from Backtype acquired by
Twitter.
S Strong support from Hortonworks.
S Lower level API’s than Spark.
S Trident is the micro-batching method that closely
resembles Spark.

Spark Streaming
S Based on the fact that not all data is required instantaneously.
S Uses micro batch method.
S Latency is approx. 1 sec.
S Streaming has single points of failure.
S Has scale issues.
S Good for machine learning.
S Strong support from Databricks, Cloudera, Hortonworks, MapR,
Datastax & Pivotal.
S Easier to integrate with Spark.

Apache Samza (Incubator)
S Stream processing API built atop Kafka and Yarn.
S Support from Linkedin.
S Very similar to Storm.
S Currently only one level of guarantee vs. multiple levels
of guarantee in Storm.

Real Time Reporting ( or near
real time)
S Hive on Tez (Stinger)
S Impala
S Drill
S Spark
S Hawq

Apache Hive on Apache Tez
S Tez is new application framework built atop YARN.
S Workflows complied to DAG’s on Tez.
S Optimizes MapReduce jobs up to 5 times faster than
Standard MapReduce.
S Supports in-memory jobs for small datasets.
S Supported by Hortonworks & MapR.

Cloudera Impala
S Massively parallel processing (MPP) architecture for
performance, with Hadoop scalability.
S Perform interactive analysis on any data stored in HDFS and
Hbase.
S Built with native Hadoop security: integrated with Kerberos for
authentication and Apache Sentry for fine-grained, role-based
authorization.
S ANSI-92 SQL support.
S Supports common Hadoop file formats: text, SequenceFiles,
Avro, RCFile, LZO and Parquet.
S Supported by Cloudera & MapR.

Apache Drill (Incubator)
S Drill is a clustered, powerful MPP (Massively Parallel
Processing) query engine for Hadoop that can process
petabytes of data, fast.
S Useful for short, interactive ad-hoc queries on large-scale data
sets.
S Capable of querying nested data in formats like JSON and
Parquet and performing dynamic schema discovery.
S Does not require a centralized metadata repository.
S Apache Drill provides direct queries on self-describing and
semi-structured data in files (such as JSON, Parquet) and
HBase tables.

Apache Spark
S Consists of multiple projects – Spark Streaming, Spark SQL,
MLib and GraphX.
S Runs atop YARN, Mesos & EC2.
S Uses the concept of RDD’s(Resilient Distributed Datasets)
where the data is immutable during transforms.
S Enables in-memory processing when needed.
S Supported by Databricks, Cloudera, MapR, Hortonworks,
Datastax & Pivotal.
S Strong support not just from Hadoop community but also from
Data Science – Mahout moving to Spark, so is Cloudera Oryx.

Pivotal HAWQ
S Part of the Pivotal platform.
S Full SQL syntax support.
S Interoperability with Hive and HBase through the Pivotal
Xtension Framework (PXF).
S Interoperability with Pivotal’s GemFire XD, their in-memory
real-time database backed by HDFS.
S Proprietary to the Pivotal platform.

What to use where?
S Dependent on Use cases.
S Use the right tool for the job.
S Sometimes several tool for the same job, especially in the
Hadoop ecosystem.
S Use what is most easiest and scalable to the enterprise in
such scenarios.

Q&A
S @sawjd22
S subashdsouza@gmail.com

The Future of Hadoop: Security and Real-Time Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Future of Hadoop: Security and Real-Time Analytics

Similar to The Future of Hadoop: Security and Real-Time Analytics (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

The Future of Hadoop: Security and Real-Time Analytics