Big Data Journey

© 2015 MapR Technologies ‹#›
Big Data Journey with Hadoop & MapR
Tug Grall
tug@mapr.com
@tgrall

© 2015 MapR Technologies ‹#›
Big Data Journey
Tug Grall
tug@mapr.com
@tgrall
Tug Grall
tug@mapr.com
@tgrall
David Pilato
david@elastic.co
@dadoonet

(Big) Data Platform
(Big) Data Project

Copy files in HDFS
hadoop fs -put dailylogs-log.zip /logs/2015/09/10/

Import RDBMS data
sqoop import --connect jdbc:mysql://db.foo.com/somedb --table
customers --target-dir /incremental_dataset --append
Files
HBase
Hive

Import RDBMS data
input {
jdbc {
jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb"
jdbc_user => "postgres"
jdbc_driver_library => "/path/to/postgresql-9.4-1201.jdbc41.jar"
jdbc_driver_class => "org.postgresql.Driver"
statement => "SELECT * from contacts"
}
}

What’s “wrong”?
Batch????

Streaming
Flume, Kafka, Logstash
to the rescue

Log
App Events
Twitter
Sensors
…
HDFS
MapR-FS
Alerts
Elasticsearch
…
DB

Log
App Events
Twitter
Sensors
…
HDFS
MapR-FS
Alerts
Elasticsearch
…
DB
Broker
Producers Consumers

Stream data into Hadoop using Flume
Server
Files
HBase
Hive
Server
Server
Server

Streams using Kafka
Files
HBase
Hive
Producer
Producer
Producer
Consumer
Consumer
Consumer
Alert

How to store your data?
• Files in a distributed file system
• Rows in NoSQL Table
• Index in Search Engine

Data Processing
• Transform the data
• Enrich the data
• Examples:
• Store data in multiple formats
• Aggregate data
• Build Recommendations
• ….

MapReduce Processing Model
• Define mappers
• Shuffling is automatic
• Define reducers
• For complex work, chain jobs together
– Use a higher level language or DSL that does this for you

Apache Spark: Fast Big Data
– Rich APIs in Java,
Scala, Python
– Interactive shell
• Fast to Run
– General execution
graphs
– In-memory storage

Spark: Unified Platform
Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Mesos
Distributed File System (HDFS, MapR-FS, S3, …)
Hadoop YARN

Files
HBase
Hive
Index
Discovery/Analytics

Files
HBase
Hive
SQL on Hadoop
• SQL Shell

• JDBC ODBC

• BI Tools

• Reporting

Example: Recommendation Platform

Machine Learning
MapR Cluster
HBase 
MapR DB
MapR-FS
Add recommendations
to movies
Capture Ratings
Movies & Recommendations
Movie Database

Conclusion
• If possible use Streams: Kafka, Logstash 
• Advanced Data Processing and Machine Learning : Spark
• Expose your data using SQL for your “BI folks” : Drill
• Aggregation and Full Text Search : Elasticsearch
• Data Visualisation : Kibana

Big Data Journey

More Related Content

What's hot

Viewers also liked

Similar to Big Data Journey

More from Tugdual Grall

Recently uploaded

Big Data Journey