Big dataanalyticsinthecloud


Published on

Published in: Technology
1 Comment
1 Like
  • Hi sir i have 2+ years of exp in IT ,currently working on JAVA in MNC,but i have more interest on BIG DATA (HADOOP),even i learned full subject i installed hadoop in my system i did many things.can you pls suggest me how could i approach to get job in this area?
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Browser based interfaceTable explorerQuery historySyntax highlightingTest mode executionExpression evaluationKilling unwanted queries
  • Big dataanalyticsinthecloud

    1. 1. BIG DATA ANALYTICS IN THE CLOUD Siva Narayanan Qubole @k2_181
    2. 2. WHO THE HELL IS THIS GUY?  PhD in Large-scale scientific data management  Parallel query processing, Greenplum Parallel Database  Hadoop and Hive at Qubole Niche. Scientific simulation apps Fortune Companies Small and medium enterprises
    3. 3. SO YOU WANT TO DO SOME BIG DATA ANALYTICS…  Want to do targeted marketing campaigns  You want to minimize churn (attrition in customer base)  Want to build a product recommendation engine Use data to improve your business
    4. 4. TYPICAL BIG DATA PROJECT  Buy lots of hardware  Buy / install software  Hire admins who can keep everything running  Hire analysts/data scientists to come up with interesting questions  Productionalize questions into reports
    5. 5. PROBLEM 1  Most organizations struggle to achieve > 40% utilization of their cluster  Exploratory and iterative  Actionable reports produced at best few times a day  Since you have to plan 2-3 years ahead, chances are you will overprovision Chen et al, VLDB 2012 Provision for peak workload
    6. 6. PROBLEM 2 Heterogeneou s Data End Users (Product Mgrs, User Ops etc.) BOTTLENECK Ops Engineers Data Scientists
    7. 7. RESULT  Big Data projects traditionally done at companies  Who can afford to overprovision  Can hire the right talent
    8. 8. LANDSCAPE IS CHANGING  Advent of clouds  Provision 10-100s of machines in minutes  Pay as you go, grow as you please  Free / cheap big-data software  Hadoop  Hive  R  Sqoop  (many more)
    9. 9. PUBLIC CLOUDS ARE GROWING Time I/ORequests More people are doing critical stuff in the cloud!
    10. 10. CLOUD PRIMITIVES  Persistent object/file store e.g. Amazon’s S3  Ability to provision cluster with pre-built images  Ability to add or remove nodes from the cluster  Hosted operational store like MySQL  Ways to bid for excess capacity (Amazon’s spot instances)  Can get up to 90% discount
    11. 11. ENTER HADOOP  Open-source implementation of Map-reduce used by Google to index trillions of web pages  Allows programmers to write distributed programs using map and reduce abstractions  Primarily Java, but supports other languages too  Ability to run these programs on large amounts of data  Uses bunch of cheap hardware, can tolerate failures
    12. 12. HADOOP SCALES!
    13. 13. ENTER HIVE  Facebook had a Multi Petabyte Warehouse  Had 80+ engineers writing Hadoop jobs  Quickly realized that files are insufficient abstractions  Need SQL concepts like tables, schemas, partitions, indices  Many, many, many more people know SQL than Hadoop  So, implemented SQL on top of Hadoop  Made data more accessible  Finally, FB open sourced it
    14. 14. HIVE  SQL* interface on top of unstructured data  Handles variety of open data formats  JSON, Text, Binary, Avro, ProtoBuf, Thrift  Extreme pluggability  Some things aren’t meant to be done in SQL  Custom Python, PHP, Ruby, Bash code  Production ready  Processes 25PB of data in FB Hive project started by Qubole founders!
    15. 15. HIVE SCALES!
    16. 16. RECAP: LANDSCAPE IS CHANGING  Advent of clouds  Free / cheap big-data software
    17. 17. THE BIG OPPORTUNITY  Hadoop++ is great for analytics, but designed for data centers  Cloud offers very different tradeoffs and opportunities Big Data Analytics in the Cloud!
    18. 18. ENTER QUBOLE Spreadsheets* BI tools Custom AppsBrowser * * Other players: • Amazon’s EMR • Treasure Data • Mortar Data
    19. 19. QUBOLE FEATURES  Simple query interface  Automated cluster management  Cloud performance enhancements  Integration with data sources / sinks  Workflows  Scheduler  Programmability
    21. 21. CLUSTER MANAGEMENT  Automatic launching, shutting down clusters at hour boundaries  Recycle bad clusters (it happens, sometimes)  Save logs for debugging  Spot instances to save costs  Sophisticated auto-scaling algorithm adjusts to usage Actual user quote: “I've basically not had to learn *anything* to get my data feed working “
    22. 22. PERFORMANCE Cloud optimized: 5x faster than Amazon’s Elastic Mapreduce
    23. 23. INTEGRATION  ODBC Driver  Tableau  Excel  Database connectors  MySQL  Vertica  MongoDB  Other Sources  Google Analytics  Omniture *  AppNexus
    24. 24. WORKFLOWS AND SCHEDULER  Example workflow:  Extract data from operational MySQL DB about customer transactions  Extract FB data on your company or product page  Run report that joins FB data with DB data to see how many people have had failed transactions have commented in FB page  Push results to reporting DB so that customer support can access in internal site  Scheduler allows you to run this workflow every night  Dealing with late arrival data  Notifications
    25. 25. PROGRAMMABILITY: REST API Python SDK to talk to Qubole
    26. 26. USE CASE  Current Customer  Most popular Q&A site  Use cases:  A/B testing on new product features and the resulting analysis  Path analysis on application usage  Operational metrics Within one month, went from 4 to 16 users!
    27. 27. ABOUT QUBOLE Ashish Thusoo CEO/Cofounder Joydeep Sen Sarma CTO/Cofounder Sadiq Shaik Director Prod Mgmt Shrikanth Shankar Head of Engineering Processed more than 2 Petabytes in August!
    28. 28. CONCLUSION  Big Data Analytics in the Cloud done right  Provision 2 node clusters or 500 node clusters with same ease  Pay as you go, grow as you please  Integrate variety of data sources  Optimized for the cloud  Reduces business risk and time to insight
    29. 29. THANK YOU! QUESTIONS? Go to to sign up for a free trial! We are hiring!   @k2_181
    30. 30. PERFORMANCE  Columnar cache – 3x speedup  Prefetch files to hide latency – 30% improvement  Optimize split computation – 8x improvement  Multi-part upload of large files  Moving files is expensive, write output directly  Qubole Hive server – 8x speedup for DDL statements  Order-by-limit query optimization – 5x improvement Cloud optimized: 5x faster than Amazon’s Elastic Mapreduce