© 2015 MapR Technologies ‹#›
Big Data Journey with Hadoop & MapR
Tug Grall
tug@mapr.com
@tgrall
YARN
© 2015 MapR Technologies ‹#›
Big Data Journey
Tug Grall
tug@mapr.com
@tgrall
Tug Grall
tug@mapr.com
@tgrall
David Pilato
david@elastic.co
@dadoonet
YARN
WHY?
https://www.domo.com/
Building new applications
Can I use my existing tools?
(Big) Data Platform
(Big) Data Project
Ingest
Store
Process
Consume
Ingest Data
Copy files in HDFS
hadoop fs -put dailylogs-log.zip /logs/2015/09/10/
Import RDBMS data
sqoop import --connect jdbc:mysql://db.foo.com/somedb --table 
customers --target-dir /incremental_dataset --append
Files
HBase
Hive
Import RDBMS data
input {
jdbc {
jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb"
jdbc_user => "postgres"
jdbc_driver_library => "/path/to/postgresql-9.4-1201.jdbc41.jar"
jdbc_driver_class => "org.postgresql.Driver"
statement => "SELECT * from contacts"
}
}
What’s “wrong”?
Batch????
Streaming
Flume, Kafka, Logstash
to the rescue
Log
App Events
Twitter
Sensors
…
HDFS
MapR-FS
Alerts
Elasticsearch
…
DB
Log
App Events
Twitter
Sensors
…
HDFS
MapR-FS
Alerts
Elasticsearch
…
DB
Broker
Producers Consumers
Stream data into Hadoop using Flume
Server
Files
HBase
Hive
Server
Server
Server
Streams using Kafka
Files
HBase
Hive
Producer
Producer
Producer
Consumer
Consumer
Consumer
Alert
Stream data using Logstash
Data Storage
Data Format
How to store your data?
• Files in a distributed file system
• Rows in NoSQL Table
• Index in Search Engine
Process Data
Data Processing
• Transform the data
• Enrich the data
• Examples:
• Store data in multiple formats
• Aggregate data
• Build Recommendations
• ….
MapReduce Processing Model
• Define mappers
• Shuffling is automatic
• Define reducers
• For complex work, chain jobs together
– Use a higher level language or DSL that does this for you
Apache Spark: Fast Big Data
– Rich APIs in Java,
Scala, Python
– Interactive shell
• Fast to Run
– General execution
graphs
– In-memory storage
Spark: Unified Platform
Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Mesos
Distributed File System (HDFS, MapR-FS, S3, …)
Hadoop YARN
Elasticsearch / Watcher
Query the data
Files
HBase
Hive
Index
Discovery/Analytics
SQL strikes back!
Files
HBase
Hive
SQL on Hadoop
• SQL Shell

• JDBC ODBC

• BI Tools

• Reporting
Elasticsearch
Kibana as a frontend
Example: Recommendation Platform
Machine Learning
MapR Cluster
HBase

MapR DB
MapR-FS
Add recommendations
to movies
Capture Ratings
Movies & Recommendations
Movie Database
Conclusion
• If possible use Streams: Kafka, Logstash

• Advanced Data Processing and Machine Learning : Spark
• Expose your data using SQL for your “BI folks” : Drill
• Aggregation and Full Text Search : Elasticsearch
• Data Visualisation : Kibana
© 2015 MapR Technologies ‹#›
Big Data Journey
Tug Grall
tug@mapr.com
@tgrall
Tug Grall
tug@mapr.com
@tgrall
David Pilato
david@elastic.co
@dadoonet

Big Data Journey