Hadoop is moving towards improved security and real-time analytics. For security, Hadoop vendors have made acquisitions and implemented features like Kerberos authentication and Apache Sentry authorization. For real-time analytics, tools are focusing on real-time streaming (like Storm, Spark Streaming, and Samza) and real-time querying of data (like Hive on Tez, Impala, Drill, and Spark). The right tool depends on use cases, and enterprises should choose what is easiest and most scalable.
Testing tools and AI - ideas what to try with some tool examples
The Future of Hadoop: Security and Real-Time Analytics
1. S
The Hadoop Path
A short presentation on where Hadoop is going
By
Subash DSouza
2. Hadoop and Google
S Hadoop came out of seminal papers released by Google
in the early 2000’s viz. GFS, MapReduce and Big Table.
S To see where Hadoop is moving is to see where Google
has gone.
S Great keynote talk by M.C. Srivas of MapR next week
that addresses this question.
4. Where I think Hadoop is
moving?
S Security
S Real Time Analytics
5. Security
S Hadoop vendors have become serious about security in the past
year
S Hortonworks’s acquisition of XA Secure
S Cloudera’s acquisition of Gazzang
S Kerberos has been the premise for authentication for quite some
time but things like audit control and MDM have been on the
horizon.
S With these acquisitions, Hadoop vendors have been positioning
themselves for a better security play.
S Cloudera has Apache Sentry, Hortonworks has Apache Knox.
S MapR supports security through authentication and authorization
6. Real Time Analytics
S Real Time Streaming
S Quickly ingest data as it comes in.
S Real Time Reporting
S Quickly process the ingested data.
8. Apache Storm
S One of the first streaming tools built.
S Very low latency, typically looking at 10-200 ms.
S Started by Nathan Marz from Backtype acquired by
Twitter.
S Strong support from Hortonworks.
S Lower level API’s than Spark.
S Trident is the micro-batching method that closely
resembles Spark.
9. Spark Streaming
S Based on the fact that not all data is required instantaneously.
S Uses micro batch method.
S Latency is approx. 1 sec.
S Streaming has single points of failure.
S Has scale issues.
S Good for machine learning.
S Strong support from Databricks, Cloudera, Hortonworks, MapR,
Datastax & Pivotal.
S Easier to integrate with Spark.
10. Apache Samza (Incubator)
S Stream processing API built atop Kafka and Yarn.
S Support from Linkedin.
S Very similar to Storm.
S Currently only one level of guarantee vs. multiple levels
of guarantee in Storm.
11. Real Time Reporting ( or near
real time)
S Hive on Tez (Stinger)
S Impala
S Drill
S Spark
S Hawq
12. Apache Hive on Apache Tez
S Tez is new application framework built atop YARN.
S Workflows complied to DAG’s on Tez.
S Optimizes MapReduce jobs up to 5 times faster than
Standard MapReduce.
S Supports in-memory jobs for small datasets.
S Supported by Hortonworks & MapR.
13. Cloudera Impala
S Massively parallel processing (MPP) architecture for
performance, with Hadoop scalability.
S Perform interactive analysis on any data stored in HDFS and
Hbase.
S Built with native Hadoop security: integrated with Kerberos for
authentication and Apache Sentry for fine-grained, role-based
authorization.
S ANSI-92 SQL support.
S Supports common Hadoop file formats: text, SequenceFiles,
Avro, RCFile, LZO and Parquet.
S Supported by Cloudera & MapR.
14. Apache Drill (Incubator)
S Drill is a clustered, powerful MPP (Massively Parallel
Processing) query engine for Hadoop that can process
petabytes of data, fast.
S Useful for short, interactive ad-hoc queries on large-scale data
sets.
S Capable of querying nested data in formats like JSON and
Parquet and performing dynamic schema discovery.
S Does not require a centralized metadata repository.
S Apache Drill provides direct queries on self-describing and
semi-structured data in files (such as JSON, Parquet) and
HBase tables.
15. Apache Spark
S Consists of multiple projects – Spark Streaming, Spark SQL,
MLib and GraphX.
S Runs atop YARN, Mesos & EC2.
S Uses the concept of RDD’s(Resilient Distributed Datasets)
where the data is immutable during transforms.
S Enables in-memory processing when needed.
S Supported by Databricks, Cloudera, MapR, Hortonworks,
Datastax & Pivotal.
S Strong support not just from Hadoop community but also from
Data Science – Mahout moving to Spark, so is Cloudera Oryx.
16. Pivotal HAWQ
S Part of the Pivotal platform.
S Full SQL syntax support.
S Interoperability with Hive and HBase through the Pivotal
Xtension Framework (PXF).
S Interoperability with Pivotal’s GemFire XD, their in-memory
real-time database backed by HDFS.
S Proprietary to the Pivotal platform.
17. What to use where?
S Dependent on Use cases.
S Use the right tool for the job.
S Sometimes several tool for the same job, especially in the
Hadoop ecosystem.
S Use what is most easiest and scalable to the enterprise in
such scenarios.