SlideShare a Scribd company logo
1 of 67
Liferay & Big Data 
Getting value from your data 
! 
Miguel Ángel Pastor Olivar 
miguel.pastor@liferay.com
Who am I? 
! 
• Some random guy 
! 
• Member of the Liferay core infrastructure 
team 
! 
•Disclaimer: Not a computer scientist 
! 
• @miguelinlas3
What are we going to talk about? 
! 
• Big Data: what is this about? 
! 
• Simple architecture proposal 
! 
• Use cases 
! 
• Questions (and hopefully answers)
Big Data?
• Data is so big that regular solutions are: 
! 
–Extremely slow 
! 
–Too small 
! 
–Really expensive 
! 
• How we use all the data we already own
! 
• Volume 
–Transactions, data streaming from social media, … 
! 
• Velocity 
–Torrents of data in real time 
! 
• Variety 
–Numerical data, text, email, video, audio, …
Popular usages
• Recommender systems 
! 
• Predicting the future: 
– Netflix does autoscaling based on past 
network data traffic 
! 
• Churn models 
– Big telco companies build social networks 
to reduce the churn
• Sentiment analysis 
–Are talking about you in the Internet? 
! 
• Real Time Bidding 
–Optimise advertising 
! 
• Health care 
–Improve patients health while reducing costs 
–Improve quality of life of multiple sclerosis patients
Terminology
• Storage models 
• How to store relevant information 
! 
• Computation models 
• Process and transform all the information 
! 
• Analytics 
• How we can take actions based on the 
previous steps
Big Data 
Architectures
Data storage
Hadoop Distributed File System (HDFS) 
! 
• Java based file system 
! 
• Scalable, fault-tolerant, distributed storage 
! 
• Designed to run on commodity hardware 
! 
• Closely related to MapReduce
Source: http://hortonworks.com/
NoSQL storage
• Semistructured data 
! 
• Focused on 
! 
• Horizontal scalability 
! 
• Availability 
! 
• Different trade-offs: CAP, BASE, … 
!
NewSQL 
storage
• Modern relational databases 
! 
• Same scalable performance than NoSQL for 
OLTP 
! 
• Maintain ACID guarantees 
! 
• A few alternatives: VoltDB, Google Spanner, 
FoundationDB, …
Computation 
and analytics
Apache Hadoop
Apache Hadoop Map Reduce 
! 
• Distributed processing 
! 
• Large datasets 
! 
•Clusters of computers 
#LRNAS2014 
! 
• Simple programming model 
! 
• Verbose and hard to use API
Liferay 
projects 
is 
the 
best 
Open 
Source 
project 
best: 1 
is: 1 
Liferay: 1 
Open: 1 
project: 2 
Source: 1 
the: 1 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
(index, “…”) 
Sort 
and 
shuffle 
(best, [1]) 
(is, [1]) 
(Liferay: 1) 
(Open, [1]) 
(project, [1,1]) 
(Source, [1]) 
(the, [1])
• Batch model data crunching 
! 
• Not so good event stream processing 
! 
• But … 
! 
• Many algorithms hard to implement using 
MapReduce 
! 
• Cascading, Scalding, Cascalog, Impala, …
Apache Storm
• Distributed realtime computation system 
! 
• Easy to reliably process unbounded streams of data 
! 
• Multi language support 
! 
• Realtime analytics, online machine learning, continuous 
computation, distributed RPC, ETL, …
Spout 
Spout 
Bolt Bolt 
Bolt
Apache Spark
• Fast and general-purpose cluster computing 
• Developed by Berkeley AMP 
! 
• High level APIs (not MapReduce) 
! 
• Optimised engine: 
• supports general execution graphs 
! 
• Higher-level tools: 
• Spark SQL, MLib, Spark Streaming, Graphx
Apache Mahout
! 
• Scalable machine learning library 
#LRNAS2014 
! 
• Built on top of Hadoop 
! 
• Some algorithms don’t require Hadoop at all 
#LRNAS2014
R language
• Focused on: 
• Data visualisation 
• Statistical computations 
• Analysis of data 
! 
• Tons of built-in packages 
! 
• Connect to Hadoop through Hadoop Streaming 
! 
• Not a fast language
Reference 
Architecture
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
Datasources
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
• System events 
! 
• User tracking (client side) 
• Clicks, navigation, activities, … 
! 
• Monitoring (transactions, load page times, …) 
! 
• Models (message boards, blogs, wiki …) 
! 
• Custom developments …
Event broker
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
Data Source 
0 1 2 3 4 5 6 7 8 
Writes 
9 
Reads Reads 
System A System B
Apache Kafka 
! 
• Publish-subscribe as distributed commit log 
! 
• Fast 
! 
• Scalable 
! 
• Durable 
! 
• Distributed by design
Broker A 
Broker B 
Producer Consumer 
Broker C 
ZooKeeper
Computation 
and analytics
RDBMS 
Event Broker 
Hadoop 
User 
Tracking 
NoSQL 
Storage 
System 
Events 
Search 
Data 
Logs 
Monitoring Dataware 
House 
Streaming Social 
Graph
Batch processing? 
! 
Real time processing? 
! 
Machine learning algorithms? 
! 
Graph analysis? 
! 
Unified programming model?
! 
• Fast and general engine for large-scale data 
processing 
! 
• Write your apps in Java, Scala or Python 
! 
• Run on YARN cluster manager 
! 
• Can read any existing Hadoop data (HDFS) 
! 
• In memory or disk
Apache Spark Main Components 
Apache Spark 
Spark SQL 
Spark 
Streaming MLib GraphX
Spark Core
• Driver main function and executes various 
parallel operations on a cluster 
! 
• Resilient Distributed Datasets (RDD) 
• HDFS (or any Hadoop file system) 
! 
• Scala collection 
! 
• Second abstraction: shared variables
Spark SQL
• Mix SQL queries with Spark programs 
! 
• Unified Data Access 
! 
• Hive compatibility 
! 
• Standard JDBC or ODBC connectivity 
! 
• Same engine for both interactive and long running 
queries
Spark Streaming
• Build your apps using high-level operators 
! 
• Fault tolerance: exactly-once semantics out of the box 
! 
• Combine streaming with batch and interactive queries 
! 
• Can read from HDFS, Flume, Kafka, Twitter and ZeroMQ 
! 
• Define your own custom data sources
Spark MLib
! 
• Basic statistics 
• Summary statistics 
• Correlations 
• …. 
! 
• Classification and regression 
• Linear models 
• Decision tress 
• Naive Bayes
! 
• Clustering 
• K-Means 
! 
• Collaborative filtering 
• Alternate least squares 
! 
• Dimensionality reduction 
• Singular value decomposition 
! 
• Principal component analysis
Spark GraphX
! 
• Graphs API and graph-parallel computation 
! 
• Growing scale and importance 
• From social networks to language modelling 
! 
• Directed multigraph with properties attached to each 
vertex and edge 
! 
• Growing collection of graph algorithms and builders
Live demo! 
Building a messages 
classifier
Takeaways
• Not about data size, but how you use it 
! 
• You already own tons of data, you just need to take get 
value from it 
! 
• There is no silver bullet: you’ve plenty of alternatives 
! 
• JVM Big data related techs are usually a great choice 
! 
• Try it yourself!!
References
!• 
Apache Kafka 
! 
• Apache Spark 
! 
• Apache Storm 
! 
• Apache Hadoop 
! 
• Big Data definition at Wikipedia 
! 
• Liferay Kafka Bridge 
! 
• What every software engineer should know about a log
Thank you!!
Questions 
(and hopefully answers)

More Related Content

What's hot

Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Databricks
 

What's hot (20)

Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDB
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
How do spark_kafka_and_syncsort_dmx-h
How do spark_kafka_and_syncsort_dmx-hHow do spark_kafka_and_syncsort_dmx-h
How do spark_kafka_and_syncsort_dmx-h
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
 
Elasticsearch JVM-MX Meetup April 2016
Elasticsearch JVM-MX Meetup April 2016Elasticsearch JVM-MX Meetup April 2016
Elasticsearch JVM-MX Meetup April 2016
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Real time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsReal time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflows
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Hybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and KubernetesHybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and Kubernetes
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 

Viewers also liked

Arianrod prefacio1
Arianrod prefacio1Arianrod prefacio1
Arianrod prefacio1
raceaguilart
 
Curso Comunicacion 2
Curso Comunicacion 2Curso Comunicacion 2
Curso Comunicacion 2
juan pablo
 
Water and Waste Water Treatment - EN - 140716 - webreduced
Water and Waste Water Treatment - EN - 140716 - webreducedWater and Waste Water Treatment - EN - 140716 - webreduced
Water and Waste Water Treatment - EN - 140716 - webreduced
Renan Norbiate de Melo
 
Origen y significado del día de muertos
Origen y significado del día de muertosOrigen y significado del día de muertos
Origen y significado del día de muertos
ommasi
 

Viewers also liked (20)

3. Sinagogas, inspiración para Grupos Pequeños
3. Sinagogas, inspiración para Grupos Pequeños3. Sinagogas, inspiración para Grupos Pequeños
3. Sinagogas, inspiración para Grupos Pequeños
 
Arianrod prefacio1
Arianrod prefacio1Arianrod prefacio1
Arianrod prefacio1
 
KIAC_Conference Report_Print
KIAC_Conference Report_PrintKIAC_Conference Report_Print
KIAC_Conference Report_Print
 
Curso Comunicacion 2
Curso Comunicacion 2Curso Comunicacion 2
Curso Comunicacion 2
 
Ruta de la tapa
Ruta de la tapaRuta de la tapa
Ruta de la tapa
 
Arrow ECS - One Source, IT Skills & Serivces
Arrow ECS - One Source, IT Skills & SerivcesArrow ECS - One Source, IT Skills & Serivces
Arrow ECS - One Source, IT Skills & Serivces
 
Algo de astronomia
Algo de astronomiaAlgo de astronomia
Algo de astronomia
 
Water and Waste Water Treatment - EN - 140716 - webreduced
Water and Waste Water Treatment - EN - 140716 - webreducedWater and Waste Water Treatment - EN - 140716 - webreduced
Water and Waste Water Treatment - EN - 140716 - webreduced
 
Integración prevención 03 10-10
Integración prevención 03 10-10Integración prevención 03 10-10
Integración prevención 03 10-10
 
CyberAttack -- Whose side is your computer on?
CyberAttack -- Whose side is your computer on?CyberAttack -- Whose side is your computer on?
CyberAttack -- Whose side is your computer on?
 
Origen y significado del día de muertos
Origen y significado del día de muertosOrigen y significado del día de muertos
Origen y significado del día de muertos
 
HSBP June Invite
HSBP June InviteHSBP June Invite
HSBP June Invite
 
Netherlands Fuel Card Briefing
Netherlands Fuel Card Briefing Netherlands Fuel Card Briefing
Netherlands Fuel Card Briefing
 
Dermlite Dermatoscopes
Dermlite DermatoscopesDermlite Dermatoscopes
Dermlite Dermatoscopes
 
Como funciona el alcohol en el cuerpo
Como funciona el alcohol en el cuerpoComo funciona el alcohol en el cuerpo
Como funciona el alcohol en el cuerpo
 
Vhigo Mase
Vhigo MaseVhigo Mase
Vhigo Mase
 
Reputacion online C4E
Reputacion online C4EReputacion online C4E
Reputacion online C4E
 
Future Academy - Cerificate
Future Academy - CerificateFuture Academy - Cerificate
Future Academy - Cerificate
 
Mr. Eduard Rodès Director of the European Short Sea Shipping School
Mr. Eduard Rodès Director of the   European Short Sea Shipping School Mr. Eduard Rodès Director of the   European Short Sea Shipping School
Mr. Eduard Rodès Director of the European Short Sea Shipping School
 
Customer Lifestage
Customer LifestageCustomer Lifestage
Customer Lifestage
 

Similar to Liferay & Big Data Dev Con 2014

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
Christopher Whitaker
 

Similar to Liferay & Big Data Dev Con 2014 (20)

Apache drill
Apache drillApache drill
Apache drill
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 

More from Miguel Pastor

Microservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservicesMicroservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservices
Miguel Pastor
 
Reactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala MeetupReactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala Meetup
Miguel Pastor
 
Liferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularityLiferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularity
Miguel Pastor
 

More from Miguel Pastor (17)

Microservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservicesMicroservices: The OSGi way A different vision on microservices
Microservices: The OSGi way A different vision on microservices
 
Reactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala MeetupReactive applications and Akka intro used in the Madrid Scala Meetup
Reactive applications and Akka intro used in the Madrid Scala Meetup
 
Reactive applications using Akka
Reactive applications using AkkaReactive applications using Akka
Reactive applications using Akka
 
Liferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularityLiferay Devcon 2013: Our way towards modularity
Liferay Devcon 2013: Our way towards modularity
 
Liferay Module Framework
Liferay Module FrameworkLiferay Module Framework
Liferay Module Framework
 
Liferay and Cloud
Liferay and CloudLiferay and Cloud
Liferay and Cloud
 
Jvm fundamentals
Jvm fundamentalsJvm fundamentals
Jvm fundamentals
 
Scala Overview
Scala OverviewScala Overview
Scala Overview
 
Hadoop, Cloud y Spring
Hadoop, Cloud y Spring Hadoop, Cloud y Spring
Hadoop, Cloud y Spring
 
Scala: un vistazo general
Scala: un vistazo generalScala: un vistazo general
Scala: un vistazo general
 
Platform as a Service overview
Platform as a Service overviewPlatform as a Service overview
Platform as a Service overview
 
HadoopDB
HadoopDBHadoopDB
HadoopDB
 
Aspect Oriented Programming introduction
Aspect Oriented Programming introductionAspect Oriented Programming introduction
Aspect Oriented Programming introduction
 
Software measure-slides
Software measure-slidesSoftware measure-slides
Software measure-slides
 
Arquitecturas MMOG
Arquitecturas MMOGArquitecturas MMOG
Arquitecturas MMOG
 
Software Failures
Software FailuresSoftware Failures
Software Failures
 
Groovy and Grails intro
Groovy and Grails introGroovy and Grails intro
Groovy and Grails intro
 

Recently uploaded

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 

Recently uploaded (20)

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 

Liferay & Big Data Dev Con 2014

  • 1. Liferay & Big Data Getting value from your data ! Miguel Ángel Pastor Olivar miguel.pastor@liferay.com
  • 2. Who am I? ! • Some random guy ! • Member of the Liferay core infrastructure team ! •Disclaimer: Not a computer scientist ! • @miguelinlas3
  • 3. What are we going to talk about? ! • Big Data: what is this about? ! • Simple architecture proposal ! • Use cases ! • Questions (and hopefully answers)
  • 5. • Data is so big that regular solutions are: ! –Extremely slow ! –Too small ! –Really expensive ! • How we use all the data we already own
  • 6. ! • Volume –Transactions, data streaming from social media, … ! • Velocity –Torrents of data in real time ! • Variety –Numerical data, text, email, video, audio, …
  • 8. • Recommender systems ! • Predicting the future: – Netflix does autoscaling based on past network data traffic ! • Churn models – Big telco companies build social networks to reduce the churn
  • 9. • Sentiment analysis –Are talking about you in the Internet? ! • Real Time Bidding –Optimise advertising ! • Health care –Improve patients health while reducing costs –Improve quality of life of multiple sclerosis patients
  • 11. • Storage models • How to store relevant information ! • Computation models • Process and transform all the information ! • Analytics • How we can take actions based on the previous steps
  • 14. Hadoop Distributed File System (HDFS) ! • Java based file system ! • Scalable, fault-tolerant, distributed storage ! • Designed to run on commodity hardware ! • Closely related to MapReduce
  • 17. • Semistructured data ! • Focused on ! • Horizontal scalability ! • Availability ! • Different trade-offs: CAP, BASE, … !
  • 19. • Modern relational databases ! • Same scalable performance than NoSQL for OLTP ! • Maintain ACID guarantees ! • A few alternatives: VoltDB, Google Spanner, FoundationDB, …
  • 22. Apache Hadoop Map Reduce ! • Distributed processing ! • Large datasets ! •Clusters of computers #LRNAS2014 ! • Simple programming model ! • Verbose and hard to use API
  • 23. Liferay projects is the best Open Source project best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  • 24. • Batch model data crunching ! • Not so good event stream processing ! • But … ! • Many algorithms hard to implement using MapReduce ! • Cascading, Scalding, Cascalog, Impala, …
  • 26. • Distributed realtime computation system ! • Easy to reliably process unbounded streams of data ! • Multi language support ! • Realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, …
  • 27. Spout Spout Bolt Bolt Bolt
  • 29. • Fast and general-purpose cluster computing • Developed by Berkeley AMP ! • High level APIs (not MapReduce) ! • Optimised engine: • supports general execution graphs ! • Higher-level tools: • Spark SQL, MLib, Spark Streaming, Graphx
  • 31. ! • Scalable machine learning library #LRNAS2014 ! • Built on top of Hadoop ! • Some algorithms don’t require Hadoop at all #LRNAS2014
  • 33. • Focused on: • Data visualisation • Statistical computations • Analysis of data ! • Tons of built-in packages ! • Connect to Hadoop through Hadoop Streaming ! • Not a fast language
  • 35. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 37. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 38. • System events ! • User tracking (client side) • Clicks, navigation, activities, … ! • Monitoring (transactions, load page times, …) ! • Models (message boards, blogs, wiki …) ! • Custom developments …
  • 40. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 41. Data Source 0 1 2 3 4 5 6 7 8 Writes 9 Reads Reads System A System B
  • 42. Apache Kafka ! • Publish-subscribe as distributed commit log ! • Fast ! • Scalable ! • Durable ! • Distributed by design
  • 43. Broker A Broker B Producer Consumer Broker C ZooKeeper
  • 45. RDBMS Event Broker Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  • 46. Batch processing? ! Real time processing? ! Machine learning algorithms? ! Graph analysis? ! Unified programming model?
  • 47.
  • 48. ! • Fast and general engine for large-scale data processing ! • Write your apps in Java, Scala or Python ! • Run on YARN cluster manager ! • Can read any existing Hadoop data (HDFS) ! • In memory or disk
  • 49. Apache Spark Main Components Apache Spark Spark SQL Spark Streaming MLib GraphX
  • 51. • Driver main function and executes various parallel operations on a cluster ! • Resilient Distributed Datasets (RDD) • HDFS (or any Hadoop file system) ! • Scala collection ! • Second abstraction: shared variables
  • 53. • Mix SQL queries with Spark programs ! • Unified Data Access ! • Hive compatibility ! • Standard JDBC or ODBC connectivity ! • Same engine for both interactive and long running queries
  • 55. • Build your apps using high-level operators ! • Fault tolerance: exactly-once semantics out of the box ! • Combine streaming with batch and interactive queries ! • Can read from HDFS, Flume, Kafka, Twitter and ZeroMQ ! • Define your own custom data sources
  • 57. ! • Basic statistics • Summary statistics • Correlations • …. ! • Classification and regression • Linear models • Decision tress • Naive Bayes
  • 58. ! • Clustering • K-Means ! • Collaborative filtering • Alternate least squares ! • Dimensionality reduction • Singular value decomposition ! • Principal component analysis
  • 60. ! • Graphs API and graph-parallel computation ! • Growing scale and importance • From social networks to language modelling ! • Directed multigraph with properties attached to each vertex and edge ! • Growing collection of graph algorithms and builders
  • 61. Live demo! Building a messages classifier
  • 63. • Not about data size, but how you use it ! • You already own tons of data, you just need to take get value from it ! • There is no silver bullet: you’ve plenty of alternatives ! • JVM Big data related techs are usually a great choice ! • Try it yourself!!
  • 65. !• Apache Kafka ! • Apache Spark ! • Apache Storm ! • Apache Hadoop ! • Big Data definition at Wikipedia ! • Liferay Kafka Bridge ! • What every software engineer should know about a log