Big Data Presentation at SCQAA-SF on June 12 2013


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Presentation: BI/Big Data Futures - Is it really all about the Cloud?In this survey session, SKS will bring you up-to-date on what's happening in the world of enterprise Business Intelligence.  BigData, NoSQL, Hadoop, Big Analytics, Cloud Storage, what does all of this mean to you as a data professional?  Which products and technologies are mature enough for enterprise adoption and which ones are not?  Which vendors should you be trying out and why? What is the reality of hosting enterprise data on the cloud? What are the business reasons to explore these new technologies?  How do you learn to implement them?SKS frames this talk with the three major trends that she sees in the Enterprise BI space, highlighting products and technologies that warrant a deeper look.  
  • From the blog -
  • Big Data Presentation at SCQAA-SF on June 12 2013

    1. 1. Welcome to Our June MeetingJune 13, 2013 1
    2. 2. • SCQAA-SF ( chapter sponsors thesharing of information to promote and encourage theimprovement in information technology quality practicesand principles through networking, training andprofessional development.• Networking: We meet once in 2 months in San FernandoValley.• Check us out on LinkedIn (SCQAA-SF)• Contact Sujit at or call 818-878-0834About SCQAA-SF- A Not-for ProfitOrganizationJune 13, 2013 2
    3. 3. Membership Benefits:• Excellent speaker presentations onadvancements in technology andmethodology• Networking opportunities• PDU, CSTE and CSQA credits• Regular meetings are free for membersand include dinnerJune 13, 2013 3
    4. 4. Membership Policy• Recently revised our membership duespolicy to better accommodate memberneeds and current economic conditions.• Annual membership is $50, or $35 forthose who are in between jobs.• Please check your renewal with CherylLeoni. If you have recently joined orrenewed, please check before renewingagainJune 13, 2013 4
    5. 5. Sunil SabatData Practitioner, Scientist, ArchitectInsights to Big Data and Quality“Ref; Jan 2012- for SoCalCodeCamp
    6. 6. Agenda• Big Data and modern data management• Old BI and New BI• Hadoop Frameworks• Big Data Quality – Hybrid Approach• Big Data Processing - ETL• Examples of Hadoop ETL/QA• Big Data QA ToDo• Q/A
    7. 7. Big Data• Today, useful data is 80% unstructured and20% structured data• Not easy to build old style warehouses, veryexpensive to build and maintain• Today, business need is real time andactionable insight driven• Big Data features volume, variety, velocity andveracity• Fact - Business need actionable intelligence tosucceed
    8. 8. Modern Data Management Hub
    9. 9. Obama Election and Big Data• “The Obama campaign found a way to integrate social media, technology, emaildatabases, fundraising databases and consumer market data,” said GOP digitalstrategist Vincent Harris, who did digital work for Newt Gingrich and Rick Perry in2012. “That does not exist on the Republican side to that degree”, to thedetriment of Mitt Romney’s campaign, quoted by Politico, “GOP seeks to up itsonline game”, December 8, 2012. For more on how the Obama campaign used bigdata, see BusinessWeek’s November 29, 2012 article “The Science Behind ThoseObama Campaign Emails”.
    10. 10. BI = ‘Current State’ Questions•What did we sell?•When did we sell it?•Where did we sell it?•What did we sell with it?CollectingTransactionaldata
    11. 11. BigData = ‘Next State’ BIQuestions• What could happen?• Why didn’t this happen?• When will the next new thinghappen?• What will the next new thing be?• What should happen?Collectingbehavioraltemporaldata
    12. 12. Comparing old and new BI dataOld BI data New BI dataData Size Gigabytes (Terabytes) Petabytes (Hexabytes)Access Interactive and Batch BatchUpdates Read / Write many times Write once, Read many timesStructure Static Schema Dynamic SchemaIntegrity High (ACID) LowScaling Nonlinear LinearDBA Ratio 1:40 1:3000Reference: Tom White’s Hadoop: The Definitive Guide
    13. 13. Deeper Comparison Chart
    14. 14. Is Data Science your next Career?
    15. 15. R-Language
    16. 16. Hadoop –MapR,HortonWorks, Cloudera,IBM, Apache….
    17. 17. Oracle Loader for Hadoop
    18. 18. SQL Server Connector forHadoop
    19. 19. Hadoop on Azure
    20. 20. Amazon AWS
    21. 21. Google App Engine Data
    22. 22. Google – MySQL & Cloud Storage
    23. 23. Big Data QA Process• Hybrid approach - can use traditional perl likescripting, tools , Junit tests on destination side• Use Hadoop jobs to refine and do ETL forunstructured data at source side• Improve upstream QA process to do most ofETL/QA at source• Leverage Hadoop infrastructure to do mining• Fact – Big Data QA window is getting smaller
    24. 24. Microsoft SSIS - Hadoop ETL• Use ODBC driver to extract data from anyHadoop HDFS• Use HDInsight ( Microsoft Hadoop ) as datastore• Use SSIS for ETL• Source lookups from Melissa Data and others• Load to SQL ServerReference URL :
    25. 25. Amazon EMR - Hadoop ETL• Design and code a JOB on Amazon AWS usingEMR (elastic map reduce )• Source lookups from Melissa Data and others• Run the job to do ETL• Read and write to S3 buckets• Use open source Pig/Latin, Java UDFs for ETLReference URL :
    26. 26. Google – Freebase & Refine
    27. 27. Karmasphere Studiofor Amazon Elastic MapReduce
    28. 28. Hadoop Connector to Excel
    29. 29. BI >BigData QA ‘To Do ListGet trained and Store some (more) data on the cloud• Relational and non-relationalProcess some data in the cloud• Do ETL , QA• Try data mining• Learn about Data ScienceUpdate your client tools• New UI (touch, gestures)• Click to Query• New form factors (phone, tablet)
    30. 30. Keep Up With Big Data QA• Learn Big Data Now ( NRIT is a bootcamp trainingprovider), Learn to write ETL/QA jobs, Query HDFS usingODBC• Assume source data is not clean, do upstream ETL and QA bylookups, reference data sets• Fact - Hadoop is being used by most of fortune 500companies now for fast analytics and insights• Fact - Investment in Hadoop is dependent on BI/analytics inthe end – Obama Election• FACT - QA matters, garbage in – garbage out is still TRUE!
    31. 31. Questions?Please contact NRIT at orsunil.sabat@gmail.comAvailable on LinkedIn and Twitter ( @ssabat)
    32. 32. NRIT Big Data Architecture
    33. 33. NRIT and BIG DATA BI