Big data
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Big data

on

  • 367 views

 

Statistics

Views

Total Views
367
Views on SlideShare
359
Embed Views
8

Actions

Likes
0
Downloads
8
Comments
0

2 Embeds 8

https://twitter.com 5
http://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big data Presentation Transcript

  • 1. Introduction to Big Data Survival Guide! Luan Cestari February 28 , 2014 1 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 2. Please, let me ask ... ● ● 2 Who already tested a product/project related to Big Data? Who does work with Big Data? RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 3. What are we going to see here ● The demystification the term ¨Big Data¨ and beyond! ● ● What does the people claim to be Big Data What is the relationship between Big Data and databases ● ● ● Some facts about database history Why there are so many DB available? How to clue all this stuff together? ● 3 Some well-known Hadoop ecosystem tools that cover a very wide of Big Data issues RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 4. Why Big Data is important ● Many companies is already dealing with Big Data using Open Source tools ● ● 4 There is demand for people to work with those tools as a developer and analyst You can also work with some integration between those system and building to improve a already existing tool or the next Big Data Tool RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 5. Why Big Data is important ● When a company is using Big Data tools, it can grow very fast and complex: ● ● ● 5 Many different clusters (due tenant, geo localized or different versions) Different technologies for very related propose (also due different team skills or use cases) Many many software integration, layers to segregate the different aspects and re factoring due the the fast pace RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 6. Cool ... but what is Big Data after all? ● Just tons of information isn't enough, it also needs to be have: ● ● Velocity ● Value ● 6 Variety And Volume RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 7. More about Volume: How Big it can be? ● What is the size of daily batch job from Facebook? 100 GB 10000GB 100000GB? ● 7 Answer:104 857 600 gigabytes of users log RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 8. More about Variety: Where the data are from? ● Customer generated Content ● M2M ● Sensors ● B2B ● B2C ● Social Network ● 8 And others Devices: mobile phones, setbox, Security Cameras RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 9. More about Value ● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can do: ● 9 Analysts (find correlations using statistics, signal processing, machine learning, persona, etc) using different kind of tools (SQL, search engines, stream processing) RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 10. More about Value ● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can: ● 10 Find correlations using statistical or predictive analytics, signal processing, machine learning, natural language processing, BI, visualization, etc using different kind of tools (SQL, search engines, stream processing) RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 11. More about Value ● ● 11 So the value are the insights generated that may help you to generate a better product, making better decision or take a competitive advantage over the other competitors The Open Source helps also the value to enable it in a cost effective way, instead buying tons of expensive tools RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 12. ... and the Velocity ● This is a very interesting point due different analyzes may require different times: ● ● 12 A traffic system may need a streaming system to analyze and predict the actual traffic and suggest better routes over the city The same traffic system may need to process several weeks to have a good prediction of the average traffic over the road, so that could be an offline batch RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 13. ... and the Velocity ● 13 The main point is that there isn't a silver bullet for this, different store system may be required for different services that it aims to provide RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 14. SQL History ● ● Hierarchical Database in 60`s Then Relational Database in 80`s and until couple years ago was the only solution used in most of the enterprise ● 14 Big companies used to buy expensive special DW database system to analyze their data RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 15. ... and now 15 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 16. ... and now 16 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 17. Again the reason for that ● For example the Web Analysis in Facebook: ● ● +240 Billion photos ● +1 Trillion connections ● ● +1 Billion users 22% of references of the Internet Harvard Business Review ● ● 17 A change from DW to a Big Data system made a 96 hours job run in just 4 hours 2012 2.5 exabyte create a day RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 18. We need to avoid the Golden hammer/Silver Bullet Anti-pattern 18 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 19. Hadoop ecosystem save the day ● ● Open Source projects that help you to deal with the Big Data Don't need vertical scaling (big machines), you ca use cluster of commodity machines and archive even better results ● Parallel Processing ● Fault tolerant Jobs ● Redundant and distributed data (for disk failure and to avoid moving data around) ● ● 19 Less complex programming model It have low level native lib for high performance RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 20. Hadoop ecosystem save the day ● ● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =( Well, there isn't silver bullet, we need more tools ● 20 so this is why Hadoop is not alone, there are many different projects which integrate with it RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 21. Hadoop ecosystem save the day ● ● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =( Well, there isn't silver bullet, we need more tools ● ● so this is why Hadoop is not alone, there are many different projects which integrate with it There are several big companies that offer Hadoop and other projects as a big product and they help the community, I will talk a little more about Hortonworks and Cloudera`s projects sets as they are very wellknown and how they integrate. Find more on http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support 21 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 22. Hadoop ecosystem save the day ● 22 Cluadera: CDH RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 23. Hadoop ecosystem save the day ● Cluadera: ● 23 How to create this whole stack with minimum effort: Cloudera Manager RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 24. Hadoop ecosystem save the day ● 24 Hortonworks: HDP RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 25. Hadoop ecosystem save the day ● Hortonworks: ● ● 25 They use Ambari to management the cluster like Claudera Manager does They also have Tez to enhance the speed of the workloads RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 26. Hadoop ecosystem save the day ● And more tools: ● ● 26 You may use Apache Mesos or Hadoop 2 YARN to better manage and sharing your services (for example tenants/cloud) Apache BigTop, Fuse-DFS, Apache Crunch, Apache Whirr, Apache Hama,Apache Giraph, Open MPI, Cascading (and its extensions), Weave, and more RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 27. Hadoop ecosystem save the day ● 27 There more tools for specific cases, like low latency with Spark ecosystem RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 28. Hadoop ecosystem save the day ● 28 But you can also use other tools for low latency such as Twitter Storm, Yahoo S4, Linkedin Samza (or Kafka), Amazon Kinesis, Google Millwheel RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 29. The integration with other system will be complex ● 29 An overview: RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 30. A different approach: Lambda Architecture ● 30 Idea from Twitter Team (like Nathan Marz) about how to deal with Big Data Systems RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 31. Questions? 31 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 32. Introduction to Big Data Survival Guide! Luan Cestari February 28 , 2014 1 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 33. Please, let me ask ... ● ● 2 Who already tested a product/project related to Big Data? Who does work with Big Data? RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD Scalable Portable On-demand Resource Management Measureable
  • 34. What are we going to see here ● The demystification the term ¨Big Data¨ and beyond! ● ● What does the people claim to be Big Data What is the relationship between Big Data and databases ● ● ● How to clue all this stuff together? ● 3 Some facts about database history Why there are so many DB available? Some well-known Hadoop ecosystem tools that cover a very wide of Big Data issues RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD The difference in http://www.slideshare.net/CAinc/cloud-expo-session-fromvirtualization-to-cloud-computing-building-an-effective-pragmatic-reliable-cloud
  • 35. Why Big Data is important ● Many companies is already dealing with Big Data using Open Source tools ● ● 4 There is demand for people to work with those tools as a developer and analyst You can also work with some integration between those system and building to improve a already existing tool or the next Big Data Tool RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 4
  • 36. Why Big Data is important ● When a company is using Big Data tools, it can grow very fast and complex: ● ● ● 5 Many different clusters (due tenant, geo localized or different versions) Different technologies for very related propose (also due different team skills or use cases) Many many software integration, layers to segregate the different aspects and re factoring due the the fast pace RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 5
  • 37. Cool ... but what is Big Data after all? ● Just tons of information isn't enough, it also needs to be have: ● ● Velocity ● Value ● 6 Variety And Volume RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 6
  • 38. More about Volume: How Big it can be? ● What is the size of daily batch job from Facebook? 100 GB 10000GB 100000GB? ● 7 Answer:104 857 600 gigabytes of users log RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 7
  • 39. More about Variety: Where the data are from? ● Customer generated Content ● M2M ● Sensors ● B2B ● B2C ● Social Network ● 8 And others Devices: mobile phones, setbox, Security Cameras RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 8
  • 40. More about Value ● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can do: ● 9 Analysts (find correlations using statistics, signal processing, machine learning, persona, etc) using different kind of tools (SQL, search engines, stream processing) RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 9
  • 41. More about Value ● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can: ● 10 Find correlations using statistical or predictive analytics, signal processing, machine learning, natural language processing, BI, visualization, etc using different kind of tools (SQL, search engines, stream processing) RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 10
  • 42. More about Value ● ● 11 So the value are the insights generated that may help you to generate a better product, making better decision or take a competitive advantage over the other competitors The Open Source helps also the value to enable it in a cost effective way, instead buying tons of expensive tools RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 11
  • 43. ... and the Velocity ● This is a very interesting point due different analyzes may require different times: ● ● 12 A traffic system may need a streaming system to analyze and predict the actual traffic and suggest better routes over the city The same traffic system may need to process several weeks to have a good prediction of the average traffic over the road, so that could be an offline batch RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 12
  • 44. ... and the Velocity ● 13 The main point is that there isn't a silver bullet for this, different store system may be required for different services that it aims to provide RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 13
  • 45. SQL History ● ● Hierarchical Database in 60`s Then Relational Database in 80`s and until couple years ago was the only solution used in most of the enterprise ● 14 Big companies used to buy expensive special DW database system to analyze their data RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 14
  • 46. ... and now 15 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 15
  • 47. ... and now 16 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 16
  • 48. Again the reason for that ● For example the Web Analysis in Facebook: ● ● +240 Billion photos ● +1 Trillion connections ● ● +1 Billion users 22% of references of the Internet Harvard Business Review ● ● 17 A change from DW to a Big Data system made a 96 hours job run in just 4 hours 2012 2.5 exabyte create a day RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 17
  • 49. We need to avoid the Golden hammer/Silver Bullet Anti-pattern 18 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 18
  • 50. Hadoop ecosystem save the day ● ● Open Source projects that help you to deal with the Big Data Don't need vertical scaling (big machines), you ca use cluster of commodity machines and archive even better results ● Parallel Processing ● Fault tolerant Jobs ● Redundant and distributed data (for disk failure and to avoid moving data around) ● ● 19 Less complex programming model It have low level native lib for high performance RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 19
  • 51. Hadoop ecosystem save the day ● ● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =( Well, there isn't silver bullet, we need more tools ● 20 so this is why Hadoop is not alone, there are many different projects which integrate with it RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 20
  • 52. Hadoop ecosystem save the day ● ● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =( Well, there isn't silver bullet, we need more tools ● ● so this is why Hadoop is not alone, there are many different projects which integrate with it There are several big companies that offer Hadoop and other projects as a big product and they help the community, I will talk a little more about Hortonworks and Cloudera`s projects sets as they are very wellknown and how they integrate. Find more on http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support 21 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 21
  • 53. Hadoop ecosystem save the day ● 22 Cluadera: CDH RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. 22
  • 54. Hadoop ecosystem save the day ● Cluadera: ● 23 How to create this whole stack with minimum effort: Cloudera Manager RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 23
  • 55. Hadoop ecosystem save the day ● 24 Hortonworks: HDP RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty 24
  • 56. Hadoop ecosystem save the day ● Hortonworks: ● ● 25 They use Ambari to management the cluster like Claudera Manager does They also have Tez to enhance the speed of the workloads RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 25
  • 57. Hadoop ecosystem save the day ● And more tools: ● ● 26 You may use Apache Mesos or Hadoop 2 YARN to better manage and sharing your services (for example tenants/cloud) Apache BigTop, Fuse-DFS, Apache Crunch, Apache Whirr, Apache Hama,Apache Giraph, Open MPI, Cascading (and its extensions), Weave, and more RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD Apache Whirr is a set of libraries for running cloud services. The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. Open MPI is a standardized API typically used for parallel and/or distributed computing 26
  • 58. Hadoop ecosystem save the day ● 27 There more tools for specific cases, like low latency with Spark ecosystem RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD Apache Whirr is a set of libraries for running cloud services. 27
  • 59. Hadoop ecosystem save the day ● 28 But you can also use other tools for low latency such as Twitter Storm, Yahoo S4, Linkedin Samza (or Kafka), Amazon Kinesis, Google Millwheel RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD Apache Whirr is a set of libraries for running cloud services. 28
  • 60. The integration with other system will be complex ● 29 An overview: RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 29
  • 61. A different approach: Lambda Architecture ● 30 Idea from Twitter Team (like Nathan Marz) about how to deal with Big Data Systems RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 30
  • 62. Questions? 31 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD