Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

TRHUG 2015 - Veloxity Big Data Migration Use Case

1,964 views

Published on

My presentation from TRHUG 2015 about a Big Data migration project which based on Spark and Impala technologies.

Published in: Technology
  • Be the first to comment

TRHUG 2015 - Veloxity Big Data Migration Use Case

  1. 1. TRHUG 2015 Veloxity Migration Use Case v1.2
  2. 2. About Me ● Hakan Ilter ○ GittiGidiyor / eBay ■ Software Platform & Research Manager ■ Java, Spring, Microservices ○ devveri.com ■ Big Data Consultant and Blogger ○ Search, Big Data, NoSQL
  3. 3. About Veloxity ● Veloxity ○ Wireless Telecom Company ■ Based in Sunnyvale, California ○ Founded in 2013 ■ by two Turkish entrepreneurs ○ CXM solutions ■ Mobile consumer experience management ○ Powerful SDK ○ “Actionable” Analytics
  4. 4. ● Rapidly Growing ○ Now ■ 75K Devices ■ 30 GB / day ○ Short-term ■ 750K devices ■ 300 GB / day ○ Mid-term ■ 7M devices ■ 3 TB / day About Data
  5. 5. ● Legacy System ○ RDBMS-Centric Architecture ■ .NET Codebase ■ MSSQL Server ○ Stored Procedures ■ Hundreds of SPs ■ Thousands of lines of code ○ Works fine (for a while) Before Migration
  6. 6. ● Legacy System Problems ○ RDBMS-Centric Architecture ■ .NET doesn’t fit ■ Can’t scale MSSQL Server ○ Stored Procedures ■ Hard to develop/maintain ■ Stored Procedure Hell! ○ Looking for another solution Before Migration
  7. 7. ● Hadoop ○ MapReduce ■ Can process large amounts of data ○ Hive ■ SQL over unstructured data ○ Impala ■ Massive parallel processing SQL engine ○ Cloudera CDH 5.x ■ Enterprise-ready Big Data Platform The answer is Hadoop
  8. 8. ● MapReduce + Hive + Impala ○ MapReduce ■ Processes JSON input ■ Creates major tables ■ Parquet columnar format as output ○ Hive ■ Query over raw data ○ Impala ■ Builds aggregation tables ■ Analytics based on these tables Veloxity Big Data v1
  9. 9. ● Spark + Impala ○ Spark ■ Replaces MapReduce ■ Better Developer Productivity ■ Better Performance ■ Rich APIs for Java, Scala, Python ■ In-memory storage ○ Impala ■ Fastest MPP SQL Engine ■ Better than Hive or Spark SQL Veloxity Big Data v2
  10. 10. Big Data Architecture Devices GZipped JSON data Tomcat Web App Copy to HDFS Hadoop Cluster Build Model with Spark Hive Metastore Build Aggregations with Impala MSSQL Server Analytics App Reporting User REST Impala Queries SQL Queries Import with Sqoop Export with Pig
  11. 11. Veloxity Big Data v2 ● Other Tools ○ Java ■ Spring Framework, Tomcat App Server ○ Bash Script ■ For task executions, flows, etc. ■ Because of Oozie! ○ Sqoop ■ Great (only) for imports ○ Pig ■ Good for data cleaning and exports
  12. 12. ● Data Process & Query Performance ○ Hardware ■ Amazon EC2 ■ m3.2xlarge ■ 8 Core, 30 GB Ram, Standard disk ■ 1 Name Node, 3 Data Nodes ○ Software ■ Cloudera CDH 5.3.2 ■ Impala 2.1.2 ■ Hive 0.13.1 ■ Spark 1.2.0 Performance Comparison
  13. 13. ● Input Data ○ 4 GB Gzip compressed ○ 12 GB uncompressed ○ 859 files ● Task ○ Process JSON files ○ Validate each record ○ Fix problems ○ Build a model ○ Save as Parquet Format Data Process Performance
  14. 14. ● Results Data Process Performance
  15. 15. ● Input Data ○ 542 MB Snappy compressed ○ 1.6 GB uncompressed ○ 11 M rows ○ 468 Parquet files ● Query SELECT deviceId, COUNT(*), AVG(rxSpeed), MAX(rxSpeed), AVG (txSpeed), MAX(txSpeed), SUM(rxData), SUM(txData) FROM stats GROUP BY deviceId ORDER BY deviceId LIMIT 100 Query Performance
  16. 16. ● Results Query Performance
  17. 17. ● Lessons Learned ● CDH updates are critical ○ Always test first! ○ Use VMs for testing ● Install Spark manually ○ The latest Spark version 1.5.0 ○ CDH 5.4.x still comes with Spark 1.3 ● The small files problem ○ Merge small files often Lessons Learned
  18. 18. ● More... ● Partitioning ○ Use partitions wisely ○ Too many partitions = slower queries ● Metadata management ○ Improvement is needed ○ Can’t remove a partition with query ● Don’t use Google Gson for JSON ○ Extremely slow ○ Use Boon Project instead Lessons Learned
  19. 19. Veloxity Big Data v3 ● Future Plans ● Vert.x ○ Lightweight, Non-blocking IO ● Apache Kafka ○ Enables streaming data ● Spark Streaming ○ Real-time data processing ● Spark Data Frames ○ No need for other tools (Sqoop, Pig, etc.)
  20. 20. ● More... ● Cloudera Kudu ○ New Storage for Fast Analytics on Fast Data ■ https://github.com/cloudera/kudu ● Project Tungsten ○ Bringing Spark Closer to Bare Metal ■ http://bit.ly/1KPpFBC ● Impala Roadmap ○ Nested Types ○ Performance Improvements Veloxity Big Data v3
  21. 21. Thanks!

×