Big dataanalyticsinthecloud

484 views
368 views

Published on

Published in: Technology
1 Comment
1 Like
Statistics
Notes
  • Hi sir i have 2+ years of exp in IT ,currently working on JAVA in MNC,but i have more interest on BIG DATA (HADOOP),even i learned full subject i installed hadoop in my system i did many things.can you pls suggest me how could i approach to get job in this area?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
484
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
18
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide
  • Browser based interfaceTable explorerQuery historySyntax highlightingTest mode executionExpression evaluationKilling unwanted queries
  • Big dataanalyticsinthecloud

    1. 1. BIG DATA ANALYTICS IN THE CLOUD Siva Narayanan Qubole snarayanan@qubole.com @k2_181
    2. 2. WHO THE HELL IS THIS GUY?  PhD in Large-scale scientific data management  Parallel query processing, Greenplum Parallel Database  Hadoop and Hive at Qubole Niche. Scientific simulation apps Fortune Companies Small and medium enterprises
    3. 3. SO YOU WANT TO DO SOME BIG DATA ANALYTICS…  Want to do targeted marketing campaigns  You want to minimize churn (attrition in customer base)  Want to build a product recommendation engine Use data to improve your business
    4. 4. TYPICAL BIG DATA PROJECT  Buy lots of hardware  Buy / install software  Hire admins who can keep everything running  Hire analysts/data scientists to come up with interesting questions  Productionalize questions into reports
    5. 5. PROBLEM 1  Most organizations struggle to achieve > 40% utilization of their cluster  Exploratory and iterative  Actionable reports produced at best few times a day  Since you have to plan 2-3 years ahead, chances are you will overprovision Chen et al, VLDB 2012 Provision for peak workload
    6. 6. PROBLEM 2 Heterogeneou s Data End Users (Product Mgrs, User Ops etc.) BOTTLENECK Ops Engineers Data Scientists
    7. 7. RESULT  Big Data projects traditionally done at companies  Who can afford to overprovision  Can hire the right talent
    8. 8. LANDSCAPE IS CHANGING  Advent of clouds  Provision 10-100s of machines in minutes  Pay as you go, grow as you please  Free / cheap big-data software  Hadoop  Hive  R  Sqoop  (many more)
    9. 9. PUBLIC CLOUDS ARE GROWING Time I/ORequests More people are doing critical stuff in the cloud!
    10. 10. CLOUD PRIMITIVES  Persistent object/file store e.g. Amazon’s S3  Ability to provision cluster with pre-built images  Ability to add or remove nodes from the cluster  Hosted operational store like MySQL  Ways to bid for excess capacity (Amazon’s spot instances)  Can get up to 90% discount
    11. 11. ENTER HADOOP  Open-source implementation of Map-reduce used by Google to index trillions of web pages  Allows programmers to write distributed programs using map and reduce abstractions  Primarily Java, but supports other languages too  Ability to run these programs on large amounts of data  Uses bunch of cheap hardware, can tolerate failures
    12. 12. HADOOP SCALES!
    13. 13. ENTER HIVE  Facebook had a Multi Petabyte Warehouse  Had 80+ engineers writing Hadoop jobs  Quickly realized that files are insufficient abstractions  Need SQL concepts like tables, schemas, partitions, indices  Many, many, many more people know SQL than Hadoop  So, implemented SQL on top of Hadoop  Made data more accessible  Finally, FB open sourced it
    14. 14. HIVE  SQL* interface on top of unstructured data  Handles variety of open data formats  JSON, Text, Binary, Avro, ProtoBuf, Thrift  Extreme pluggability  Some things aren’t meant to be done in SQL  Custom Python, PHP, Ruby, Bash code  Production ready  Processes 25PB of data in FB Hive project started by Qubole founders!
    15. 15. HIVE SCALES!
    16. 16. RECAP: LANDSCAPE IS CHANGING  Advent of clouds  Free / cheap big-data software
    17. 17. THE BIG OPPORTUNITY  Hadoop++ is great for analytics, but designed for data centers  Cloud offers very different tradeoffs and opportunities Big Data Analytics in the Cloud!
    18. 18. ENTER QUBOLE Spreadsheets* BI tools Custom AppsBrowser * * Other players: • Amazon’s EMR • Treasure Data • Mortar Data
    19. 19. QUBOLE FEATURES  Simple query interface  Automated cluster management  Cloud performance enhancements  Integration with data sources / sinks  Workflows  Scheduler  Programmability
    20. 20. QUERY INTERFACE
    21. 21. CLUSTER MANAGEMENT  Automatic launching, shutting down clusters at hour boundaries  Recycle bad clusters (it happens, sometimes)  Save logs for debugging  Spot instances to save costs  Sophisticated auto-scaling algorithm adjusts to usage Actual user quote: “I've basically not had to learn *anything* to get my data feed working “
    22. 22. PERFORMANCE Cloud optimized: 5x faster than Amazon’s Elastic Mapreduce
    23. 23. INTEGRATION  ODBC Driver  Tableau  Excel  Database connectors  MySQL  Vertica  MongoDB  Other Sources  Google Analytics  Omniture *  AppNexus
    24. 24. WORKFLOWS AND SCHEDULER  Example workflow:  Extract data from operational MySQL DB about customer transactions  Extract FB data on your company or product page  Run report that joins FB data with DB data to see how many people have had failed transactions have commented in FB page  Push results to reporting DB so that customer support can access in internal site  Scheduler allows you to run this workflow every night  Dealing with late arrival data  Notifications
    25. 25. PROGRAMMABILITY: REST API Python SDK to talk to Qubole
    26. 26. USE CASE  Current Customer  Most popular Q&A site  Use cases:  A/B testing on new product features and the resulting analysis  Path analysis on application usage  Operational metrics Within one month, went from 4 to 16 users!
    27. 27. ABOUT QUBOLE Ashish Thusoo CEO/Cofounder Joydeep Sen Sarma CTO/Cofounder Sadiq Shaik Director Prod Mgmt Shrikanth Shankar Head of Engineering Processed more than 2 Petabytes in August!
    28. 28. CONCLUSION  Big Data Analytics in the Cloud done right  Provision 2 node clusters or 500 node clusters with same ease  Pay as you go, grow as you please  Integrate variety of data sources  Optimized for the cloud  Reduces business risk and time to insight
    29. 29. THANK YOU! QUESTIONS? Go to http://www.qubole.com to sign up for a free trial! We are hiring! jobs@qubole.com  snarayanan@qubole.com  @k2_181
    30. 30. PERFORMANCE  Columnar cache – 3x speedup  Prefetch files to hide latency – 30% improvement  Optimize split computation – 8x improvement  Multi-part upload of large files  Moving files is expensive, write output directly  Qubole Hive server – 8x speedup for DDL statements  Order-by-limit query optimization – 5x improvement Cloud optimized: 5x faster than Amazon’s Elastic Mapreduce

    ×