Plug 20110217

Putting Analytics in
Big Data Analytics
Matt Casters, Chief of Data Integration
Pentaho Corporation

PLUG – Feb 17th, 2011

© 2010, Pentaho. All Rights Reserved. www.pentaho.com.

Big Data

Terabytes and petabytes of data
Sometimes per day

010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide

Example Use Cases Today
Transactional
•Fraud detection
•Financial services/stock markets
Sub-Transactional
•Weblogs
•Social/online media
•Telecoms events


Example Use Cases Today
Non-Transactional
•Web pages, blogs etc
•Documents
•Physical events
•Application events
•Machine events

In most cases structured or semi-structured


Data Lake
• Single source
• Large volume
• Not distilled


Data Lakes
• 0-2 lakes per company
• Known and unknown questions
• Multiple user communities
• $1-10k questions, not $1m ones
• Don’t fit in traditional RDBMS with a
reasonable cost


Data Lake Requirements
• Store all the data
• Satisfy routine reporting and analysis
• Satisfy ad-hoc query / analysis / reporting
• Balance performance and cost


Traditional BI
Data Mart(s)

Tape/Trash

Data ? ? ?
Source ?
? ??


What if...
Data Mart(s) Ad-Hoc Data Warehouse

Data Lake(s)

Data
Source


Big Data Does Not Replace Data Marts
• It’s not a database
• High latency
• Optimized for massive data-crunching
• Databases are immature
• Databases are no-SQL


Big Data

Map/Reduce
And
Sometimes per day
Hadoop


What is Map/Reduce
• Obligatory Wikipedia quote: “... is a patented software

framework introduced by Google to support
distributed computing on large data sets on clusters
of computers”
• Invented by Google to index “The Internet”

• Apache Hadoop is an Open Source implementation of the

Map/Reduce algorithm
• Scalable & fault-tolerant, not efficient!

What Hadoop Really Is
• Core components
• HDFS – a distributed file system allowing massive storage across a cluster of
commodity servers
• Map-Reduce
• Framework for distributed computation, common use cases include
aggregating, sorting, and filtering BIG data sets
• Problem is broken up into small fragments of work that can be computed or
recomputed in isolation on any node of the cluster
• Related Projects
• Hive – a data warehouse infrastructure on top of Hadoop
• Implements a SQL like Query language, including a JDBC driver
• Allows MapReduce developers to plugin custom mappers and reducers
• Hbase – the Hadoop database – AH HA!
• A variant of NoSQL databases, problematic for traditional BI
• Best at storing large amounts of unstructured data

No seriously, what’s is Hadoop?
Java software framework that supports data-
intensive distributed applications
• Apache project
• Created by Yahoo, Google’s idea
• Distributed filesystem + MapReduce engine
• Commodity hardware
• Scales out beyond technology and/or
economy of RDBMS


Hadoop and BI?
• Distributed processing
• Distributed file system
• Commodity hardware
• Platform independent (in theory)
• Scales out beyond technology and/or
economy of a RDBMS

In many cases it’s the only viable solution


Hadoop and BI?

90% of new Hadoop use cases
are transformation of
semi/structured data*

* of those companies we’ve talked to...


Hadoop and BI?

“The working conditions
within Hadoop are shocking”

ETL Developer


Hadoop and BI?
Instead of this...


Hadoop and BI?
You have to do this in Java...
•public void map(
• Text key,
• Text value,
• OutputCollector output,
• Reporter reporter)

•public void reduce(
• Text key,
• Iterator values,
• OutputCollector output,
• Reporter reporter)


People don’t use
Hadoop for BI because
they want to...


...they do it because
they have to...


... and unfortunately it
wasn’t designed
for most BI requirements


Why not add to Hadoop
the things it’s missing...


... until it can do
what we need it to?


If only we had a
Java, embeddable,
data transformation engine...


Pentaho Data Integration
Data Marts, Data Warehouse,
Analytical Applications

Pentaho Data
Integration
Design
Pentaho Data Deploy
Hadoop Integration
Orchestrate
Pentaho Data
Integration

Visualize Reporting / Dashboards /
Analysis

Web Tier

DM & DW RDBMS
Optimize
Hive
Hadoop
Files / HDFS

Load Applications & Systems


Reporting / Dashboards /
Analysis

Web Tier

DM RDBMS

Hive
Hadoop
HDFS


Inside the VM

pentaho-hadoop-vm

Hadoop

HDFS Hive

Job

Mapper Reducer


Inside a job
Job

Mapper Reducer

*
Java Application Java Application

Scripting Scripting

* Combiner can be used to pre-reduce in memory on the mappers before data is transmitted.


Inside a job with PDI
Job

Mapper Reducer

PDI Execution Engine PDI Execution Engine

Transformation Transformation

Step
Step Step
Step
Step Step


Demo


The Single Threaded Transformation Engine

• Designed to use a single thread
• Processes rows per batch because Hadoop
delivers rows in batches
• Knows when the batch of rows is processed
• Is only initialized once and disposed of once
• Has reduced overhead for data passing
between steps


The Single Threaded Transformation Engine

• Is no longer used inside of Hadoop thanks
to new developments. “The multi-threaded
engine is still faster” they said.
• Is being introduced into PDI 4.2.0 (CE)
• You will be able to specify a mapping to run
single threaded
• Allows you to reduce context switching in
large to huge transformations (lots of steps)


Pentaho for Hadoop Resources

Download www.pentaho.com/download/hadoop
Pentaho for Hadoop webpage - resources, press,
events, partnerships and more:
www.pentaho.com/hadoop
Big Data Analytics: 5 part video series with James
Dixon, Pentaho CTO

Or contact me : mcasters at pentaho dot org


Thank You.

Join the conversation. You can find us on:

http://blog.pentaho.com

@Pentaho

Pentaho Facebook Group

Pentaho - Open Source Business Intelligence Group


Plug 20110217

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Plug 20110217

Similar to Plug 20110217 (20)

More from Skills Matter

More from Skills Matter (20)

Plug 20110217