Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,744
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Putting Analytics in Big Data AnalyticsMatt Casters, Chief of Data Integration Pentaho Corporation PLUG – Feb 17th, 2011 © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
  • 2. Big Data Terabytes and petabytes of data Sometimes per day010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 3. Example Use Cases Today Transactional •Fraud detection •Financial services/stock markets Sub-Transactional •Weblogs •Social/online media •Telecoms events010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 4. Example Use Cases Today Non-Transactional •Web pages, blogs etc •Documents •Physical events •Application events •Machine events In most cases structured or semi-structured010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 5. Data Lake • Single source • Large volume • Not distilled010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 6. Data Lakes • 0-2 lakes per company • Known and unknown questions • Multiple user communities • $1-10k questions, not $1m ones • Don’t fit in traditional RDBMS with a reasonable cost010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 7. Data Lake Requirements • Store all the data • Satisfy routine reporting and analysis • Satisfy ad-hoc query / analysis / reporting • Balance performance and cost010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 8. Traditional BI Data Mart(s) Tape/Trash Data ? ? ? Source ? ? ??010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 9. What if... Data Mart(s) Ad-Hoc Data Warehouse Data Lake(s) Data Source010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 10. Big Data Does Not Replace Data Marts • It’s not a database • High latency • Optimized for massive data-crunching • Databases are immature • Databases are no-SQL2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 11. Big Data Map/Reduce And Sometimes per day Hadoop010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 12. What is Map/Reduce• Obligatory Wikipedia quote: “... is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers”• Invented by Google to index “The Internet”• Apache Hadoop is an Open Source implementation of the Map/Reduce algorithm• Scalable & fault-tolerant, not efficient!
  • 13. What Hadoop Really Is• Core components • HDFS – a distributed file system allowing massive storage across a cluster of commodity servers • Map-Reduce • Framework for distributed computation, common use cases include aggregating, sorting, and filtering BIG data sets • Problem is broken up into small fragments of work that can be computed or recomputed in isolation on any node of the cluster• Related Projects • Hive – a data warehouse infrastructure on top of Hadoop • Implements a SQL like Query language, including a JDBC driver • Allows MapReduce developers to plugin custom mappers and reducers • Hbase – the Hadoop database – AH HA! • A variant of NoSQL databases, problematic for traditional BI • Best at storing large amounts of unstructured data
  • 14. No seriously, what’s is Hadoop? Java software framework that supports data- intensive distributed applications • Apache project • Created by Yahoo, Google’s idea • Distributed filesystem + MapReduce engine • Commodity hardware • Scales out beyond technology and/or economy of RDBMS2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 15. Hadoop and BI? • Distributed processing • Distributed file system • Commodity hardware • Platform independent (in theory) • Scales out beyond technology and/or economy of a RDBMS In many cases it’s the only viable solution2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 16. Hadoop and BI? 90% of new Hadoop use cases are transformation of semi/structured data* * of those companies we’ve talked to...2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 17. Hadoop and BI? “The working conditions within Hadoop are shocking” ETL Developer2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 18. Hadoop and BI? Instead of this...2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 19. Hadoop and BI? You have to do this in Java... •public void map( • Text key, • Text value, • OutputCollector output, • Reporter reporter) •public void reduce( • Text key, • Iterator values, • OutputCollector output, • Reporter reporter)2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 20. People don’t use Hadoop for BI because they want to...010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 21. ...they do it because they have to...010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 22. ... and unfortunately it wasn’t designed for most BI requirements2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 23. Why not add to Hadoop the things it’s missing...010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 24. ... until it can do what we need it to?010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 25. If only we had a Java, embeddable, data transformation engine...010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 26. Pentaho Data Integration Data Marts, Data Warehouse, Analytical Applications Pentaho Data Integration Design Pentaho Data Deploy Hadoop Integration Orchestrate Pentaho Data Integration010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 27. Visualize Reporting / Dashboards / Analysis Web Tier DM & DW RDBMS Optimize Hive Hadoop Files / HDFS Load Applications & Systems010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 28. Reporting / Dashboards / Analysis Web Tier DM RDBMS Hive Hadoop HDFS010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 29. 30000ft View Host Machine pentaho-hadoop-vm Hadoop PDI Client HDFS Hive Tasks and Jobs2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 29
  • 30. Inside the VM pentaho-hadoop-vm Hadoop HDFS Hive Job Mapper Reducer2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 30
  • 31. Inside a job Job Mapper Reducer * Java Application Java Application Scripting Scripting * Combiner can be used to pre-reduce in memory on the mappers before data is transmitted.2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 31
  • 32. Inside a job with PDI Job Mapper Reducer PDI Execution Engine PDI Execution Engine Transformation Transformation Step Step Step Step Step Step2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 32
  • 33. Demo010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 34. The Single Threaded Transformation Engine • Designed to use a single thread • Processes rows per batch because Hadoop delivers rows in batches • Knows when the batch of rows is processed • Is only initialized once and disposed of once • Has reduced overhead for data passing between steps2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 35. The Single Threaded Transformation Engine • Is no longer used inside of Hadoop thanks to new developments. “The multi-threaded engine is still faster” they said. • Is being introduced into PDI 4.2.0 (CE) • You will be able to specify a mapping to run single threaded • Allows you to reduce context switching in large to huge transformations (lots of steps)2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 36. Pentaho for Hadoop Resources Download www.pentaho.com/download/hadoop Pentaho for Hadoop webpage - resources, press, events, partnerships and more: www.pentaho.com/hadoop Big Data Analytics: 5 part video series with James Dixon, Pentaho CTO Or contact me : mcasters at pentaho dot org010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 37. Thank You. Join the conversation. You can find us on: http://blog.pentaho.com @Pentaho Pentaho Facebook Group Pentaho - Open Source Business Intelligence Group010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide