Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data Analytics 
to the masses 
Why it has failed and how we can fix it 
Jose Luis Lopez Pino @jllopezpino
Who am I? 
BI Consultant 
Large-Scale & Distributed 
Founding 
Data Engineer
Big Data is like Tourism 
But if you aren’t an expert, 
you can’t make the most of it 
It seems easy to do
Struggle to analyze Big Data 
Harlan Harris, Sean Murphy, and Marck Vaisman. Analyzing the Analyzers: An Introspective Sur...
Tools 
Volker Markl. Breaking the chains: On declarative data analysis and data independence in the big data era. 
Proceed...
Tools (Now) 
Original: Volker Markl. Breaking the chains: On declarative data analysis and data independence in the 
big d...
Deep analytics
We need libraries... 
Libraries! 
Query languages 
Write your own 
MR/RDD/Transformations
… comprehensive ones!
Say it with memes! 
When you do 
Deep analytics in small data 
using R and CRAN packages 
When you do 
deep analytics in B...
When you try to program it 
using MapReduce 
When you try to program it 
using Apache Spark / 
Apache Flink 
When you try ...
Can’t we do it better? 
- Make it similar to normal R 
programs. 
- Hide complexity. 
- Make file manipulation easier. 
- ...
Our approach
Our approach
Behind the scenes: Before
Behind the scenes: After
Without writing significantly different code
Competitive or even faster than R native code in small data
Competitive even in highly iterative programs in small data
And it scales
Some relevant findings 
- Transmission time was not significant. 
- Stratosphere/Flink was competitive even in 
small data...
4 Takeaways from this talk 
- We still need to bring Big Data to the right 
people in the right place. 
- We need comprehe...
That’s all! 
- Have you found this talk interesting? 
- Follow me: @jllopezpino 
- Looking for a job? (SEM Data Analyst, 
...
Upcoming SlideShare
Loading in …5
×

BDS14 Big Data Analytics to the masses

2,027 views

Published on

Slides from my talk at Big Data Spain 2014 in Madrid.

In this talk, we will discuss our approach to bring large scale deep analytics to the masses. R is an extremely popular numerical computer environment, but scientific data processing frequently hits its memory limits. On the other hand, system to execute data intensive tasks like Hadoop or Stratosphere are not popular among R users because writing programs using these paradigms is cumbersome. We present an innovative approach to overcome these limitations using the Stratosphere/Apache Flink big data platform by means of a R package and ready-to-use distributed algorithm.

This solution allows the user, with small modifications in the R code, to easily execute distributed scenarios using popular machine learning techniques. We will cover the implementation details of the proposed solution including the architecture of the system, the functionality implemented and working examples.

In addition, we will cover what are the differences between our approach and other solutions that integrate R with Hadoop or other large-scale analytics systems. Finally, the results of the performance tests show that this solution is competitive with the already existing R implementations for small amounts of data and able to scale-up to gigabyte level.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

BDS14 Big Data Analytics to the masses

  1. 1. Big Data Analytics to the masses Why it has failed and how we can fix it Jose Luis Lopez Pino @jllopezpino
  2. 2. Who am I? BI Consultant Large-Scale & Distributed Founding Data Engineer
  3. 3. Big Data is like Tourism But if you aren’t an expert, you can’t make the most of it It seems easy to do
  4. 4. Struggle to analyze Big Data Harlan Harris, Sean Murphy, and Marck Vaisman. Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work. O’Reilly Media, Inc., 2013 Also: Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. Enterprise data analysis and visualization: An interview study. Visualization and Computer Graphics, IEEE Transactions
  5. 5. Tools Volker Markl. Breaking the chains: On declarative data analysis and data independence in the big data era. Proceedings of the VLDB Endowment, 7(13), 2014
  6. 6. Tools (Now) Original: Volker Markl. Breaking the chains: On declarative data analysis and data independence in the big data era. Proceedings of the VLDB Endowment, 7(13), 2014
  7. 7. Deep analytics
  8. 8. We need libraries... Libraries! Query languages Write your own MR/RDD/Transformations
  9. 9. … comprehensive ones!
  10. 10. Say it with memes! When you do Deep analytics in small data using R and CRAN packages When you do deep analytics in BIG data using R and CRAN packages
  11. 11. When you try to program it using MapReduce When you try to program it using Apache Spark / Apache Flink When you try to use a library scalable to large data sets
  12. 12. Can’t we do it better? - Make it similar to normal R programs. - Hide complexity. - Make file manipulation easier. - Part of the computing in the cluster and part of the computer in the client.
  13. 13. Our approach
  14. 14. Our approach
  15. 15. Behind the scenes: Before
  16. 16. Behind the scenes: After
  17. 17. Without writing significantly different code
  18. 18. Competitive or even faster than R native code in small data
  19. 19. Competitive even in highly iterative programs in small data
  20. 20. And it scales
  21. 21. Some relevant findings - Transmission time was not significant. - Stratosphere/Flink was competitive even in small datasets. - Changes in the code were required. - Ensemble scenarios are the most exciting ones.
  22. 22. 4 Takeaways from this talk - We still need to bring Big Data to the right people in the right place. - We need comprehensive libraries. - We need to move data back and forth. - Use a syntax that the users are familiar with.
  23. 23. That’s all! - Have you found this talk interesting? - Follow me: @jllopezpino - Looking for a job? (SEM Data Analyst, Senior Analyst) - GYG is hiring: - Are you interested in Data + Energy? - Keep in touch:

×