Liferay & Big Data Dev Con 2014

Liferay & Big Data
Getting value from your data
!
Miguel Ángel Pastor Olivar
miguel.pastor@liferay.com

Who am I?
!
• Some random guy
!
• Member of the Liferay core infrastructure
team
!
•Disclaimer: Not a computer scientist
!
• @miguelinlas3

What are we going to talk about?
!
• Big Data: what is this about?
!
• Simple architecture proposal
!
• Use cases
!
• Questions (and hopefully answers)

• Data is so big that regular solutions are:
!
–Extremely slow
!
–Too small
!
–Really expensive
!
• How we use all the data we already own

!
• Volume
–Transactions, data streaming from social media, …
!
• Velocity
–Torrents of data in real time
!
• Variety
–Numerical data, text, email, video, audio, …

• Recommender systems
!
• Predicting the future:
– Netflix does autoscaling based on past
network data traffic
!
• Churn models
– Big telco companies build social networks
to reduce the churn

• Sentiment analysis
–Are talking about you in the Internet?
!
• Real Time Bidding
–Optimise advertising
!
• Health care
–Improve patients health while reducing costs
–Improve quality of life of multiple sclerosis patients

• Storage models
• How to store relevant information
!
• Computation models
• Process and transform all the information
!
• Analytics
• How we can take actions based on the
previous steps

Hadoop Distributed File System (HDFS)
!
• Java based file system
!
• Scalable, fault-tolerant, distributed storage
!
• Designed to run on commodity hardware
!
• Closely related to MapReduce

Source: http://hortonworks.com/

• Semistructured data
!
• Focused on
!
• Horizontal scalability
!
• Availability
!
• Different trade-offs: CAP, BASE, …
!

• Modern relational databases
!
• Same scalable performance than NoSQL for
OLTP
!
• Maintain ACID guarantees
!
• A few alternatives: VoltDB, Google Spanner,
FoundationDB, …

Apache Hadoop Map Reduce
!
• Distributed processing
!
• Large datasets
!
•Clusters of computers
#LRNAS2014
!
• Simple programming model
!
• Verbose and hard to use API

Liferay
projects
is
the
best
Open
Source
project
best: 1
is: 1
Liferay: 1
Open: 1
project: 2
Source: 1
the: 1
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
(index, “…”)
Sort
and
shuffle
(best, [1])
(is, [1])
(Liferay: 1)
(Open, [1])
(project, [1,1])
(Source, [1])
(the, [1])

• Batch model data crunching
!
• Not so good event stream processing
!
• But …
!
• Many algorithms hard to implement using
MapReduce
!
• Cascading, Scalding, Cascalog, Impala, …

• Distributed realtime computation system
!
• Easy to reliably process unbounded streams of data
!
• Multi language support
!
• Realtime analytics, online machine learning, continuous
computation, distributed RPC, ETL, …

• Fast and general-purpose cluster computing
• Developed by Berkeley AMP
!
• High level APIs (not MapReduce)
!
• Optimised engine:
• supports general execution graphs
!
• Higher-level tools:
• Spark SQL, MLib, Spark Streaming, Graphx

!
• Scalable machine learning library
#LRNAS2014
!
• Built on top of Hadoop
!
• Some algorithms don’t require Hadoop at all
#LRNAS2014

• Focused on:
• Data visualisation
• Statistical computations
• Analysis of data
!
• Tons of built-in packages
!
• Connect to Hadoop through Hadoop Streaming
!
• Not a fast language

RDBMS
Event Broker
Hadoop
User
Tracking
NoSQL
Storage
System
Events
Search
Data
Logs
Monitoring Dataware
House
Streaming Social
Graph

• System events
!
• User tracking (client side)
• Clicks, navigation, activities, …
!
• Monitoring (transactions, load page times, …)
!
• Models (message boards, blogs, wiki …)
!
• Custom developments …

Data Source
0 1 2 3 4 5 6 7 8
Writes
9
Reads Reads
System A System B

Apache Kafka
!
• Publish-subscribe as distributed commit log
!
• Fast
!
• Scalable
!
• Durable
!
• Distributed by design

Broker A
Broker B
Producer Consumer
Broker C
ZooKeeper

Batch processing?
!
Real time processing?
!
Machine learning algorithms?
!
Graph analysis?
!
Unified programming model?

!
• Fast and general engine for large-scale data
processing
!
• Write your apps in Java, Scala or Python
!
• Run on YARN cluster manager
!
• Can read any existing Hadoop data (HDFS)
!
• In memory or disk

Apache Spark Main Components
Apache Spark
Spark SQL
Spark
Streaming MLib GraphX

• Driver main function and executes various
parallel operations on a cluster
!
• Resilient Distributed Datasets (RDD)
• HDFS (or any Hadoop file system)
!
• Scala collection
!
• Second abstraction: shared variables

• Mix SQL queries with Spark programs
!
• Unified Data Access
!
• Hive compatibility
!
• Standard JDBC or ODBC connectivity
!
• Same engine for both interactive and long running
queries

• Build your apps using high-level operators
!
• Fault tolerance: exactly-once semantics out of the box
!
• Combine streaming with batch and interactive queries
!
• Can read from HDFS, Flume, Kafka, Twitter and ZeroMQ
!
• Define your own custom data sources

!
• Basic statistics
• Summary statistics
• Correlations
• ….
!
• Classification and regression
• Linear models
• Decision tress
• Naive Bayes

!
• Clustering
• K-Means
!
• Collaborative filtering
• Alternate least squares
!
• Dimensionality reduction
• Singular value decomposition
!
• Principal component analysis

!
• Graphs API and graph-parallel computation
!
• Growing scale and importance
• From social networks to language modelling
!
• Directed multigraph with properties attached to each
vertex and edge
!
• Growing collection of graph algorithms and builders

Live demo!
Building a messages
classifier

• Not about data size, but how you use it
!
• You already own tons of data, you just need to take get
value from it
!
• There is no silver bullet: you’ve plenty of alternatives
!
• JVM Big data related techs are usually a great choice
!
• Try it yourself!!

!•
Apache Kafka
!
• Apache Spark
!
• Apache Storm
!
• Apache Hadoop
!
• Big Data definition at Wikipedia
!
• Liferay Kafka Bridge
!
• What every software engineer should know about a log

Questions
(and hopefully answers)

Liferay & Big Data Dev Con 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Liferay & Big Data Dev Con 2014

Similar to Liferay & Big Data Dev Con 2014 (20)

More from Miguel Pastor

More from Miguel Pastor (17)

Recently uploaded

Recently uploaded (20)

Liferay & Big Data Dev Con 2014