Your SlideShare is downloading. ×
Meta scale kognitio hadoop webinar
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Meta scale kognitio hadoop webinar

132
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
132
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Webinar: Make Big Data Easy with the Right tools and talent - MetaScale Expertise and Kognitio Analytics Accelerate Hadoop for Organizations Large and SmallOctober 2012
  • 2. Today’s webinar• 45 minutes with 15 minutes Q&A• We will email you a link to the slides• Feel free to use the Q & A feature
  • 3. Agenda Presenters• Opening introduction Dr. Phil Shelley• MetaScale Expertise CEO, MetaScale – Case study – Sears CTO, Sears Holdings Holdings Roger Gaskell• Kognitio Analytics CTO Kognitio – Hadoop acceleration explained Host• Summary Michael Hiskey VP Marketing & Business• Q&A Development Kognitio
  • 4. Big Data < > Hadoop Big Data is high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision-making Volume (not only) size Velocity (speed of Input / Output) Variety (lots of data sources) Value – not the SIZE of your data, but what you can DO with it!
  • 5. OK, so you’ve decided to put data in Hadoop... Now what? Dr. Phil Shelley CEO – MetaScale CTO Sears Holdings
  • 6. Where Did We Start at Sears?
  • 7. Where Did We Start? Issues with meeting production schedules Multiple copies of data, no single point of truth ETL complexity, cost of software and cost to manage Time take to setup ETL data sources for projects Latency in data, up to weeks in some cases Enterprise Data Warehouses unable to handle load Mainframe workload over consuming capacity IT Budgets not growing – BUT data volumes escalating
  • 8. Why Hadoop? Traditional Databases & Warehouses Hadoop
  • 9. An Ecosystem
  • 10. Enterprise Integration Data Sourcing  Connecting to Legacy source systems  Loaders and tools (speed considerations)  Batch or near-real time Enterprise Data Model  Establish a model and enterprise data strategy early Data Transformations  The End of ETL as we know it Data re-use  Drive re-use of data  Single point of truth is now a possibility Data Consumption and user Interaction  Consume data in-place wherever possible  Move data only if you have to  Exporting to legacy systems can be done, but it duplicates data  Loaders and tools (speed considerations)  How will your users interact with the data
  • 11. Rethink Everything The way you capture data The way you store data The structure of your data The way you analyze data The costs of data storage The size of your data What you can analyze The speed of analysis The skills of your teamThe way user interact with data
  • 12. The Learning from our Journey• Big Data tools are here and ready for the Enterprise• An Enterprise Data Architecture model is essential• Hadoop can handle Enterprise workload  To reduce strain on legacy platforms  To reduce cost  To bring new business opportunities• Must be part of an overall data strategy• Not to be underestimated• The solution must be an Eco-System  There has to be a simple way to consume the data Page 12
  • 13. Hadoop Strengths & Weaknesses? • Cost effective platform • Powerful / fast data processing environment • Good at standard reporting • Flexibility: Programmable, Any data type • Huge scalability • Barriers to entry: lots of engineering and coding • High on-going coding requirements • Difficult to access with standard BI/analytical tools • Ad hoc complex analytics difficult • Too slow for interactive analytics
  • 14. Reference Architecture
  • 15. What is an “In-memory” Analytical Platform?• DBMS where all of the data of interest or specific portions of the data have been permanently pre-loaded into random access memory (RAM)• Not a large cache – Data is held in structures that take advantage of the properties of RAM – NOT copies of frequently used disk blocks – The databases query optimiser knows at all times exactly which data is in memory (and which is not)
  • 16. In-Memory Analytical Database MangementNot a large cache:• No disk access during query execution – Temporary tables in RAM – Results sets in RAM• In-Memory means in high speed RAM – NOT slow flash-based SSDs that mimic mechanical disksFor more information:• Gartner: “Whos Who in In-Memory DBMSs” Roxanne Edjlali, Donald Feinberg 10 Sept 2012 www.gartner.com/id=2151315
  • 17. Why In-memory: RAM is Faster Than Disk (Really!)Actually, this only part of the story Analytics completely change the workload workload characteristics on the database Simple reporting and transactional processing filtering is all about “filtering” the data of interest Analytics is all about complex “crunching”crunching of the data once it is filtered CPU Crunching needs processing power and cycles consumes CPU cycles Storing data on physical disks severely limits the storing rate at which data can be provided to the CPUs Accessing data directly from RAM allows access much more CPU power to be deployed
  • 18. Analytics is about through Data CPU cycle-intensive & CPU-bound “CRUNCHING” Analytical Joins Functions Aggregations Sorts Grouping • To understand what is happening in the data More complex More pronounced analytics = this becomes • Analytical platforms are therefore CPU-bound – Assume disk I/O speeds not a bottleneck – In-memory removes the disk I/O bottleneck
  • 19. For Analytics, the CPU is King• The key metric of any analytical platform should be GB/CPU – It needs to effectively utilize all available cores – Hyper threads are NOT the equivalent of cores• Interactive/adhoc analytics: – THINK data to core ratios ≈ 10GB data per CPU core• Every cycle is precious – CPU cores need to used efficiently – Techniques such as “dynamic machine code generation” Careful – performance impact of compression: Makes disk-based databases go faster Makes in-memory databases go slower
  • 20. Speed & Scale are the Requirements• Memory & CPU on an individual server = NOWHERE near enough for big data – Moore’s Law – The power of a processor doubles every two years – Data volumes – Double every year!!• The only way to keep up is to parallelise or scale-out • Combine the RAM of many individual servers • many CPU cores spread across Many • many CPUs, housed in • many individual computers – Data is split across all the CPU cores – All database operations need to be parallelised with no points of serialisation – This is true MPP • Every CPU core in Every • Every server needs to efficiently involved in • Every query
  • 21. Hadoop ConnectivityKognitio - External Tables – Data held on disk in other systems can be seen as non-memory resident tables by Kognitio users. – Users can select which data they wish to “suck” into memory. • Using GUI or scripts – Kognitio seamlessly sucks data out of the source system into Kognitio memory. – All managed via SQLKognitio - Hadoop Connectors – Two types • HDFS Connector • Filter Agent Connector – Designed for high speed • Multiple parallel load streams • Demonstrable 14TB+/hour load rates
  • 22. Tight Hadoop integrationHDFS Connector• Connector defines access to hdfs file system• External table accesses row-based data in hdfs• Dynamic access or “pin” data into memory• Complete hdfs file is loaded into memoryFilter Agent Connector• Connector uploads agent to Hadoop nodes• Query passes selections and relevant predicates to agent• Data filtering and projection takes place locally on each Hadoop node• Only data of interest in loaded into memory via parallel load streams
  • 23. Not Only SQLKognitio V8 External Scripts – Run third party scripts embedded within SQL • Perl, Python, Java, R, SAS, etc. • One-to-many rows in, zero-to-many rows out, one to onecreate interpreter perlinterp command /usr/bin/perl sends csv receives csv ;select top 1000 words, count(*) This reads long comments from (external script using environment perlinterp text from customer enquiry receives (txt varchar(32000)) sends (words varchar(100)) table, in line perl converts script Sendofperl( long text into output while(<>) { stream of words (one word chomp(); per row), query selects top s/[,.!_]//g; foreach $c (split(/ /)) 1000 words by frequency { if($c =~ /^[a-zA-Z]+$/) { print "$cn”} } using standard SQL } )endofperl aggregation from (select comments from customer_enquiry))dtgroup by 1order by 2 desc;
  • 24. Hardware Requirements forIn-memory Platforms• Hadoop = industry standard servers• Careful to avoid vendor lock-in• Off the shelf, low cost, servers match neatly with Hadoop – Intel or AMD CPU (x86) – No special components• Ethernet network• Standard OS
  • 25. Benefits of an In-memory Analytical Platform• A seamless in-memory analytical layer on top of your data persistence layer(s): Analytical queries that used to run in hours and minutes, now run in minutes and seconds (often sub-second) High query throughput = massively higher concurrency Flexibility • Enables greater query complexity • Users freely interact with data • Use preferred BI Tools (relational or OLAP) Reduced complexity • Administration de-skilled • Reduced data duplication
  • 26. The Learning from our Journey • Big Data tools are here and ready for the Enterprise • An Enterprise Data Architecture model is essential • Hadoop can handle Enterprise workload  To reduce strain on legacy platforms  To reduce cost  To bring new business opportunities • Must be part of an overall data strategy • Not to be underestimated • The solution must be an Eco-System  There has to be a simple way to consume the data Page 26
  • 27. connect contactwww.kognitio.com Michael Hiskey Vice Presidentkognitio.com/blog Marketing & Business Development Michael.hiskey@kognitio.comtwitter.com/kognitio Phone: +1 (855) KOGNITIOlinkedin.com/companies/kognitio Dr. Phil Shelleyfacebook.com/kognitio CEO – MetaScale CTO Sears Holdingsyoutube.com/user/kognitio Upcoming Web Briefings: kognitio.com/briefings