Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Slide 2 collecting, storing and analyzing big data

362 views

Published on

my slide for Big Data class

Published in: Data & Analytics
  • Be the first to comment

Slide 2 collecting, storing and analyzing big data

  1. 1. Collecting, Storing and Analyzing Big Data trieunt@fpt.com.vn tantrieuf31@gmail.com Big Data Process Development
  2. 2. Agenda Collecting → Storing → Processing → Analyzing → Learning → Reacting Data engineering process: 3 tasks 1. Collecting a. Concepts b. Technology 2. Storing a. Big Data Storage Concepts b. Big Data Storage Technology 3. Processing a. Big Data Processing Concepts b. Big Data Processing Technology Data Science/Machine Learning process: 3 tasks 4) Analyzing → 5) Learning → 5) Reacting
  3. 3. Big Data Analytics Lifecycle Collecting Storing Processing
  4. 4. (Collecting) → Storing → Processing → Analyzing → Learning → Reacting
  5. 5. Collecting
  6. 6. Collecting tools Batch collecting: Apache Sqoop ( from DBMS to Apache Hadoop) Real-time collecting: RFX-tracking (from stream data to Apache Kafka)
  7. 7. Collecting → (Storing) → Processing → Analyzing → Learning → Reacting
  8. 8. Storing Concepts Clusters File Systems and Distributed File Systems NoSQL Sharding Replication Sharding and Replication CAP Theorem ACID BASE
  9. 9. Clusters
  10. 10. NoSQL
  11. 11. Sharding
  12. 12. Replication (Master-Slave)
  13. 13. Replication (Peer-to-Peer)
  14. 14. CAP Theorem
  15. 15. Collecting → Storing → (Processing) → Analyzing → Learning → Reacting
  16. 16. Processing concepts Parallel Data Processing Distributed Data Processing Hadoop Processing Workloads Cluster Processing in Batch Mode Processing in Realtime Mode
  17. 17. Parallel Data Processing
  18. 18. Distributed Data Processing
  19. 19. Hadoop Hadoop is a versatile framework that provides both processing and storage capabilities
  20. 20. Batch processing (offline processing)
  21. 21. Transactional processing
  22. 22. Cluster
  23. 23. Map and Reduce Tasks
  24. 24. Processing in Realtime Mode
  25. 25. Tools
  26. 26. When standard relational database (Oracle,MySQL, ...) is not good enough the “analytic system” MySQL database from a startup, tracking all actions in mobile games: iOS, Android, ...
  27. 27. 3 common problems in Big Data System 1. Size: the volume of the datasets is a critical factor. 2. Complexity: the structure, behaviour and permutations of the datasets is a critical factor. 3. Technologies: the tools and techniques which are used to process a sizable or complex dataset is a critical factor.
  28. 28. What is Apache Phoenix ? Apache Phoenix is a SQL skin over HBase. It means scaling Phoenix just like scale-up and scale-out the Hbase
  29. 29. Phoenix SQL Engine
  30. 30. Interesting features of Apache Phoenix ● Embedded JDBC driver implements the majority of java.sql interfaces, including the metadata APIs. ● Allows columns to be modeled as a multi-part row key or key/value cells. ● Full query support with predicate push down and optimal scan key formation. ● DDL support: CREATE TABLE, DROP TABLE, and ALTER TABLE for adding/removing columns. ● Versioned schema repository. Snapshot queries use the schema that was in place when data was written. ● DML support: UPSERT VALUES for row-by-row insertion, UPSERT SELECT for mass data transfer between the same or different tables, and DELETE for deleting rows. ● Limited transaction support through client-side batching. ● Single table only - no joins yet and secondary indexes are a work in progress. ● Follows ANSI SQL standards whenever possible ● Requires HBase v 0.94.2 or above ● 100% Java
  31. 31. the Phoenix table schema
  32. 32. Setting JDBC Phoenix Driver
  33. 33. Phoenix and SQL tool in Eclipse 4
  34. 34. Phoenix vs Hive (running over HDFS and HBase) http://phoenix.apache.org/performance.html Performance: Phoenix vs Hive
  35. 35. Readings 1. https://medium.baqend.com/real-time-stream-processors-a-survey-and-d ecision-guidance-6d248f692056#.s00ac9xtu 2. https://medium.baqend.com/nosql-databases-a-survey-and-decision-guid ance-ea7823a822d#.pn63unwx6 3. https://www.infoq.com/articles/apache-kafka 4. https://docs.google.com/document/d/1ZtEhLw3lrQSeNWVEJkKLLy8B8t9zA 0MGRqmCOV3_hsA/edit?usp=sharing

×