Big Data Presentation at SCQAA-SF on June 12 2013Presentation Transcript
Welcome to Our June MeetingJune 13, 2013 1
• SCQAA-SF (www.scqaa.net) chapter sponsors thesharing of information to promote and encourage theimprovement in information technology quality practicesand principles through networking, training andprofessional development.• Networking: We meet once in 2 months in San FernandoValley.• Check us out on LinkedIn (SCQAA-SF)• Contact Sujit at email@example.com or call 818-878-0834About SCQAA-SF- A Not-for ProfitOrganizationJune 13, 2013 2
Membership Benefits:• Excellent speaker presentations onadvancements in technology andmethodology• Networking opportunities• PDU, CSTE and CSQA credits• Regular meetings are free for membersand include dinnerJune 13, 2013 3
Membership Policy• Recently revised our membership duespolicy to better accommodate memberneeds and current economic conditions.• Annual membership is $50, or $35 forthose who are in between jobs.• Please check your renewal with CherylLeoni. If you have recently joined orrenewed, please check before renewingagainJune 13, 2013 4
Sunil SabatData Practitioner, Scientist, ArchitectInsights to Big Data and Quality“Ref; Jan 2012- for SoCalCodeCamp
Agenda• Big Data and modern data management• Old BI and New BI• Hadoop Frameworks• Big Data Quality – Hybrid Approach• Big Data Processing - ETL• Examples of Hadoop ETL/QA• Big Data QA ToDo• Q/A
Big Data• Today, useful data is 80% unstructured and20% structured data• Not easy to build old style warehouses, veryexpensive to build and maintain• Today, business need is real time andactionable insight driven• Big Data features volume, variety, velocity andveracity• Fact - Business need actionable intelligence tosucceed
Modern Data Management Hub
Obama Election and Big Data• “The Obama campaign found a way to integrate social media, technology, emaildatabases, fundraising databases and consumer market data,” said GOP digitalstrategist Vincent Harris, who did digital work for Newt Gingrich and Rick Perry in2012. “That does not exist on the Republican side to that degree”, to thedetriment of Mitt Romney’s campaign, quoted by Politico, “GOP seeks to up itsonline game”, December 8, 2012. For more on how the Obama campaign used bigdata, see BusinessWeek’s November 29, 2012 article “The Science Behind ThoseObama Campaign Emails”.
BI = ‘Current State’ Questions•What did we sell?•When did we sell it?•Where did we sell it?•What did we sell with it?CollectingTransactionaldata
BigData = ‘Next State’ BIQuestions• What could happen?• Why didn’t this happen?• When will the next new thinghappen?• What will the next new thing be?• What should happen?Collectingbehavioraltemporaldata
Comparing old and new BI dataOld BI data New BI dataData Size Gigabytes (Terabytes) Petabytes (Hexabytes)Access Interactive and Batch BatchUpdates Read / Write many times Write once, Read many timesStructure Static Schema Dynamic SchemaIntegrity High (ACID) LowScaling Nonlinear LinearDBA Ratio 1:40 1:3000Reference: Tom White’s Hadoop: The Definitive Guide
Deeper Comparison Chart
Is Data Science your next Career?
Hadoop –MapR,HortonWorks, Cloudera,IBM, Apache….
Oracle Loader for Hadoop
SQL Server Connector forHadoop
Hadoop on Azure
Google App Engine Data
Google – MySQL & Cloud Storage
Big Data QA Process• Hybrid approach - can use traditional perl likescripting, tools , Junit tests on destination side• Use Hadoop jobs to refine and do ETL forunstructured data at source side• Improve upstream QA process to do most ofETL/QA at source• Leverage Hadoop infrastructure to do mining• Fact – Big Data QA window is getting smaller
Microsoft SSIS - Hadoop ETL• Use ODBC driver to extract data from anyHadoop HDFS• Use HDInsight ( Microsoft Hadoop ) as datastore• Use SSIS for ETL• Source lookups from Melissa Data and others• Load to SQL ServerReference URL :http://sqlmag.com/blog/use-ssis-etl-hadoop
Amazon EMR - Hadoop ETL• Design and code a JOB on Amazon AWS usingEMR (elastic map reduce )• Source lookups from Melissa Data and others• Run the job to do ETL• Read and write to S3 buckets• Use open source Pig/Latin, Java UDFs for ETLReference URL :http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-etl.html
Google – Freebase & Refine
Karmasphere Studiofor Amazon Elastic MapReduce
Hadoop Connector to Excel
BI >BigData QA ‘To Do ListGet trained and Store some (more) data on the cloud• Relational and non-relationalProcess some data in the cloud• Do ETL , QA• Try data mining• Learn about Data ScienceUpdate your client tools• New UI (touch, gestures)• Click to Query• New form factors (phone, tablet)
Keep Up With Big Data QA• Learn Big Data Now ( NRIT is a bootcamp trainingprovider), Learn to write ETL/QA jobs, Query HDFS usingODBC• Assume source data is not clean, do upstream ETL and QA bylookups, reference data sets• Fact - Hadoop is being used by most of fortune 500companies now for fast analytics and insights• Fact - Investment in Hadoop is dependent on BI/analytics inthe end – Obama Election• FACT - QA matters, garbage in – garbage out is still TRUE!
Questions?Please contact NRIT at www.nritinc.com firstname.lastname@example.orgAvailable on LinkedIn and Twitter ( @ssabat)