Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data analysis in java world

872 views

Published on

Overview of Big Data approaches (MapReduce, MPP, In-Memory), their Java integration and Sample project from real life.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Big data analysis in java world

  1. 1. Big Data Analysis in Java World by Serhiy Masyutin
  2. 2. Agenda  The Big Data Problem  Map-Reduce  MPP-based Analytical Database  In-Memory Data Grid  Real-Life Project  Q&A
  3. 3. The Big Data Problem http://www.datameer.com/images/product/big_data_hadoop/img_bigdata.png
  4. 4. The Big Data Problem Map-Reduce MPP AD IMDG When do I need it? In an hour In a minute Now What do I need to do with it? Exploratory analytics Structured analytics Singular event processing (some analytics), Transactions How will I query and search? Unstructured Ad hoc SQL Structured How do I need to store it? I do, but not required to I must and I am required to Temporarily Where is it coming from? File/ETL File/ETL Event/Stream/F ile/ETL http://blog.pivotal.io/pivotal/products/exploring-big-data-solutions-when-to-use-hadoop-vs-in-memory-vs-mpp
  5. 5. The Big Data Problem Map-Reduce MPP AD IMDG Transactions Customer records Geo-spatial Sensors Social Media XML, JSON Raw Logs Text Image Video moreprocessing http://blog.pivotal.io/big-data-pivotal/products/exploratory-data-science-when-to-use-an-mpp-database-sql-on-hadoop-or-map-reduce
  6. 6. The Big Data Problem Data is not Information - Clifford Stoll
  7. 7. Map-Reduce http://jeremykun.files.wordpress.com/2014/10/mapreduceimage.gif?w=1800
  8. 8. Map-Reduce https://anonymousbi.files.wordpress.com/2012/11/hadoopdiagram.png
  9. 9. Map-Reduce http://hadoop.apache.org/docs/r1.2.1/images/hdfsarchitecture.gif
  10. 10. Map-Reduce https://anonymousbi.files.wordpress.com/2012/11/hadoopdiagram.png
  11. 11. Map-Reduce Volume Variety Velocity Medium- Large Unstructured data Batch processing
  12. 12. MPP Analytical Database http://www.ndm.net/datawarehouse/images/stories/greenplum/gp-dia-3-0.png
  13. 13. MPP Analytical Database http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagram.png
  14. 14. MPP Analytical Database http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagramOneNodeDown.png
  15. 15. MPP Analytical Database http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagramTwoNodesDown.png
  16. 16. MPP Analytical Database http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/DataK-Safety-K2Nodes2And3Failed.png
  17. 17. MPP Analytical Database JDBC http://www.ndm.net/datawarehouse/images/stories/greenplum/gp-dia-3-0.png
  18. 18. MPP Analytical Database Volume Variety Velocity Small- Medium- Large Structured data Interactive ASTER DATABASEMatrix
  19. 19. In-Memory Data Grid https://ignite.incubator.apache.org/images/in_memory_data.png
  20. 20. In-Memory Data Grid https://ignite.incubator.apache.org/images/in_memory_data.png
  21. 21. In-Memory Data Grid https://ignite.incubator.apache.org/images/in_memory_compute.png
  22. 22. In-Memory Data Grid http://hazelcast.com/wp-content/uploads/2013/12/IMDGEmbeddedMode_w1000px.png
  23. 23. In-Memory Data Grid Volume Variety Velocity Small- Medium Structured data (Near) Real- Time
  24. 24. Real-Life Project  Sensor data  Currently number of devices doubles every year  Data flow ~200GB/month  Target data flow ~500GB/month
  25. 25. Real-Life Project Requirements When do I need it? In a minute What do I need to do with it? Structured analytics How will I query and search? Ad hoc SQL How do I need to store it? I must and I am required to Where is it coming from? XML
  26. 26. Real-Life Project  Time-series data  RESTful API  Extendable analytics  Scalability  Speed to Market
  27. 27. Real-Life Project
  28. 28. Availability Zone C Availability Zone B Availability Zone A Real-Life Project Processor Raw message store Client API Collector Analytic Executor Pool Analytics API Clients Devices 3rd Party Services Post-Processor UI Recent data store Permanent data store
  29. 29. Real-Life Project  Vertica stores time-series data only  Append-only data store  Store organizational data separately  Use Vertica’s ExternalFilter for data load  R analytics as UDFs on Vertica  Scale Vertica cluster accordingly
  30. 30. Real-Life Project  Choose the right tool for the job, late changes are expensive  You can do everything yourself. Should you?
  31. 31. Q&A

×