Successfully reported this slideshow.
Your SlideShare is downloading. ×

UKOUG 2015 - Big Data At Work : Two Customer Case-Studies from the Rittman Mead Big Data Implementation Team

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 53 Ad

UKOUG 2015 - Big Data At Work : Two Customer Case-Studies from the Rittman Mead Big Data Implementation Team

Download to read offline

There's a lot of hype and interest in big data, with many Oracle customers and partners looking to use it to extend the capabilities of their data warehouse, collect nest types of customer and event data into what are being termed "data lakes", and apply techniques such as machine learning and unstructured data analysis to find new patterns and insights from their data.
In this session, Rittman Mead talks about two real-world Oracle big data implementations, covering data warehouse extension and ETL offloading at one, the other based on data science and Internet-of-Things. We'll share startup tips, implementation experiences and walk through the product architecture and delivery approach for each of the examples.

There's a lot of hype and interest in big data, with many Oracle customers and partners looking to use it to extend the capabilities of their data warehouse, collect nest types of customer and event data into what are being termed "data lakes", and apply techniques such as machine learning and unstructured data analysis to find new patterns and insights from their data.
In this session, Rittman Mead talks about two real-world Oracle big data implementations, covering data warehouse extension and ETL offloading at one, the other based on data science and Internet-of-Things. We'll share startup tips, implementation experiences and walk through the product architecture and delivery approach for each of the examples.

Advertisement
Advertisement

More Related Content

More from Mark Rittman (20)

Recently uploaded (20)

Advertisement

UKOUG 2015 - Big Data At Work : Two Customer Case-Studies from the Rittman Mead Big Data Implementation Team

  1. 1. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com “Big Data At Work” - 
 Two Customer Stories from the 
 Rittman Mead Big Data Team Mark Rittman, CTO, Rittman Mead March 2015
  2. 2. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com About Me •Mark Rittman, Oracle ACE Director, Oracle BI, DW & Big Data •14 Years Experience with Oracle Technology •Regular columnist for Oracle Magazine •Author of two Oracle Press Oracle BI books •Oracle Business Intelligence Developers Guide •Oracle Exalytics Revealed •Writer for Rittman Mead Blog :
 http://www.rittmanmead.com/blog •Past Editor of Oracle Scene Magazine,
 BIRT SIG Chair, ODTUG Board Member •Co-founder and CTO for Rittman Mead
  3. 3. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com What This Session Is About… •Share our view on how big data and Hadoop is transforming the BI, DW and Core Tech world •Go through our jointly-developed Information Management Reference Architecture •Use two projects to illustrate these products and technologies in-use: •Telecoms + Cable TV Company, Amsterdam •Internet of Things, Advanced Analytics, PoC > BDA and use of OBIEE, ODI, ORAAH etc •Global Bank + Credit Card Business, London & New York •DW-Offloading from Teradata, OBIEE on Hadoop
  4. 4. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Everyone’s Talking About … Big Data •Explosion in volume and variety of data that’s now available •New, cheap and open-source technology
 makes it economic to store + process it •New businesses are being built, and
 existing ones disrupted, by rise of big data •IT departments are using Hadoop + other
 technologies to complement, and replace,
 proprietary databases, storage etc
  5. 5. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Why is Hadoop & Big Data of Interest to Us? •Gives us an ability to store more data, at more detail, for longer •Provides a cost-effective way to analyse vast amounts of data •Hadoop & NoSQL technologies can give us “schema-on-read” capabilities •There’s vast amounts of innovation in this area we can harness •And it’s very complementary to Oracle BI & DW
  6. 6. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Two Example Projects •Telecoms & Cable Television, Amsterdam •Internet of Things, Advanced Analytics, PoC > BDA and use of OBIEE, ODI, ORAAH etc •Banking & Credit Cards, London & New York •DW-Offloading from Teradata •Spark-based Machine Learning
  7. 7. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Case Study #1 :
 International Cable TV + Telephony Company
  8. 8. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Telecoms + Cable TV Company •Worlds largest international cable company, with operations in 14 countries, and 35,000 employees. •As of December 31, 2013, it had an annual revenue of >$15 billion. Its cable services pass >40 million homes, with >20 million customers or approx 50 million RGUs (video, internet, and voice subscribers)
  9. 9. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Business Problem to be Addressed •Customer was looking to improve their products internally and make their content more relevant to customers ‣Better product quality, higher NPS and higher engagement with customers •Initial area of focus for PoC was the “review buffer” ‣Frequency and length of usage ‣For which channels + programs ‣Decision regarding on-board storage on 
 new platforms based on the insights
  10. 10. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Project Challenges •Client wanted to try out Hadoop platform, but had no servers, software, skills, experience •Previous implementation partners had taken months and achieved nothing •Client was nearing a deadline to make decisions on set-top box purchases •Jobs were on the line…
  11. 11. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Rittman Mead - Implementation Partner •Long-term partnership with client based on Oracle BI + DW work •Cloudera Partner focusing on data discovery + analytics solutions •Oracle + Cloudera capabilities key for customer move into big data ‣Core Cloudera Enterprise + data science skills ‣Ability to integrate with wider Oracle stack ‣Agile & flexible development approach ‣Fully-dedicated consulting team with
 Cloudera and Oracle certifications ‣Co-developed Big Data Reference Arch
 in conjunction with Oracle
  12. 12. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Initial Platform for PoC : Cloudera Express + Oracle R •Oracle R distribution for data analysis •Oracle R Advanced Analytics for Hadoop ‣Create Hadoop MapReduce jobs using
 R, run on Hadoop for greater scalability •Cloudera Express CDH4.5 ‣Free CDH version that aligns with
 CDH Enterprise on Oracle Big Data Appliance •4x virtual servers with 8GB RAM each •Analysis work mainly in R on workstations •Focus on R, Apache Hive, ORAAH and 
 MapReduce - also constrained by VM RAM
  13. 13. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Working with Cloudera Hadoop (CDH) - Observations •Very good product stack, enterprise-friendly, big community, can do lots with free edition •Cloudera have their favoured Hadoop technologies - Spark, Kafka •Also makes use of Cloudera-specific tools - Impala, Cloudera Manager etc •But ignores some tools that have value - Apache Tez for example •Easy for an Oracle developer to get productive with the CDH stack •But beware of some immature technologies / products ‣Hive != Oracle SQL ‣Spark is very much an “alpha” product ‣Limitations in things like LDAP integration, end-to-end security ‣Lots of products in stack = lots of places
 to go to diagnose issues
  14. 14. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Analysis Approach Used •“Data Science” approach was used to try and extract meaning and correlation from the two datasets, previously too big to analyse together •Analysis performed using R with ORAAH giving ability to source dataframes from Hive •Focus on the “discovery” part ‣Lots of iterations, R graph output ‣Governance, productionizing comes later •Six-week PoC with team of 3 consultants ‣Technical architect (me) ‣Data scientist ‣Platform engineer
  15. 15. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Oracle R Advanced Analytics for Hadoop Key Features •Run R functions on Hive Dataframes •Write MapReduce functions in R
  16. 16. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com So How Did the PoC Proceed?
  17. 17. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com So How Did the PoC Proceed?
  18. 18. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com So How Did the PoC Proceed?
  19. 19. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com So How Did the PoC Proceed?
  20. 20. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com And Next Steps … Industrialising the Solution
  21. 21. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com The Oracle IM + Big Data Reference Architecture Actionable
 Events Event Engine Data 
 Reservoir Data Factory Enterprise Information Store Reporting Discovery Lab Actionable Information Actionable
 Insights Input Events Execution Innovation Discovery Output Events 
 & Data Structured
 Enterprise Data Other Data
  22. 22. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com The PoC Was Effectively our “Discovery Lab”
  23. 23. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com … Now We Need to Industrialize It
  24. 24. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Moving from PoC into Production … What’s Different? •We’re still loading and storing into Hadoop and NoSQL, but… ‣There’s governance and change control ‣Data is secured ‣Data loading and pipelines are resilient and “industrialized” ‣We use ETL tools, BI tools and search tools to enable access by end-users ‣We think about design standards, file and directory layouts, metadata etc •Build on insights and models created in the Discovery Lab •Put them into production so the business can rely on them
  25. 25. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Moving Project into Production - Oracle Big Data Appliance •Appliance from Oracle combining
 Oracle Hardware + Software, 
 and Cloudera Enterprise •One-click setup of Kerberos,
 Sentry, CDH etc •Simple patch process for entire
 OS, software + CDH stack •Single point of escalation for
 CDH + Oracle support issues •Integrates with Oracle Exadata
 and Exalytics Engineered Systems
  26. 26. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Working with Oracle Big Data Appliance •Don’t underestimate the value of “pre-integrated” - massive time-saver for client ‣No need to integrate Big Data Connectors, ODI Agent etc with HDFS, Hive etc etc •Single support route - raise SR with Oracle, they will route to Cloudera if needed •Single patch process for whole cluster - OS, CDH etc etc •Full access to Cloudera Enterprise features •Otherwise … just another CDH cluster in terms of SSH access etc •We like it ;-)
  27. 27. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Data Pipelines in Hadoop •Process by which data is loaded (“ingested”) into Hadoop & Big Data Platforms ‣Simple batch scripts using shell scripts, Hive SQL, Cron jobs ‣More complex transformations using Pig, Cascading and other M/R-generating jobs ‣More recently, real-time ingestion using Spark Streaming, Kafka etc •Have some key characteristics we want to preserve in any migration to a DI tool ‣Ability to still leverage external libraries, UDFs etc, and integrate with Flume, Oozie etc ‣Keep pipeline concept & and preserve dataflow / lazy execution approach
  28. 28. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com The Problem with Scripts and Scripting •But scripts, custom code and individual vs. team development doesn’t scale •We figured this out years ago with data warehousing - we can repeat this again with big data! ‣Lack of coding standards ‣High cost of maintaining
 custom code over time ‣Can reduce ability to quickly
 respond to model changes ‣No solution for data quality,
 metadata management ‣Requires specialist staff who
 are hard to find/replace
  29. 29. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Oracle Data Integrator Big Data Options •ODI provides an excellent framework for running Hadoop ETL jobs ‣ELT approach pushes transformations down to Hadoop - leveraging power of cluster •Hive, HBase, Sqoop and OLH/ODCH KMs provide native Hadoop loading / transformation ‣Whilst still preserving RDBMS push-down ‣Extensible to cover Pig, Spark etc •Process orchestration •Data quality / error handling •Metadata and model-driven
  30. 30. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Future Plans for Cloudera + Oracle Stack at Client •Move more data processing to Apache Spark •Continue rollout of Cloudera Impala as ad-hoc query engine •Workload management using YARN (multi-tenancy?) •Interest in Kudu as analytics storage layer •More real-time/stream processing using Kafka + Spark Streaming •Use of Oracle Data Integrator Big Data Option for ETL hardening •Further integration with Oracle BI, Database and Big Data stack
  31. 31. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Lessons Learned •Start small - don’t be afraid to use VMs, free versions of Hadoop distributions •Work closely with the client - don’t get too wrapped-up in technology from day 1 •Use an agile approach and be prepared for new sprints to take you in new directions •Have plans for how to productionize what you’ve done - but preserve innovation lab •Recognise the value in Hadoop appliances and proper support as you move beyond PoC •Expect much more need to help the client’s IT team as they adopt this new technology
  32. 32. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Case Study #2 :
 Banking & Credit Cards Client
  33. 33. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Banking & Credit Cards Client •Global bank, credit cards and payment business. •Big user of Teradata hardware and software (and Oracle…) •Long-time customer of Rittman Mead ‣OBIEE and Essbase development ‣Endeca Information Discovery ‣Now Hadoop and Big Data
  34. 34. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Business Requirement : DW-Offloading & Predictive Analytics •Desire to move core DW platforms currently running on Teradata to lower-cost Hadoop •All new data loaded into DW will go into Hadoop, querying via OBIEE, Hive/Impala •Use as opportunity to integrate new datasets ‣Combine Salesforce data with financials, for example •Use machine learning and predictive analytics to gain 
 further insight into the business •Other requirements to consider for the system included ‣Data in Hadoop needs to be updatable ‣Require a PoC first to prove technology ‣Hadoop will replace, not complement, Teradata
 (therefore query federation not possible)
  35. 35. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Initial Design Proposal •Extract data from salesforce.com into flat files •Load files into Apache HBase tables, with Apache Hive table metadata over them ‣Loading into HBase permits full CRUD activity, rather than append-only loading •Query using Apache Hive and OBIEE
  36. 36. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com What is HBase? •Based on BigTable paper from Google, 2006, Dean et al. ‣“Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.”Key Features: ‣Distributed storage across cluster of machines – Random, online read and write data access ‣Schemaless data model (“NoSQL”) ‣Self-managed data partitions •Why we chose it for the project? ‣Allows us to do update and delete
 activity rather than just Hive append-only ‣Very fast for incremental loading ‣Can define Hive tables over HBase ones,
 allowing OBIEE to then access them
  37. 37. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Programmatically Loading HBase Tables using Python •Direct extract from salesforce.com into HBase 
 using Python and add-in packages ‣Python packages extend functionality 
 by adding APIs, integration etc ‣Happybase, Beatbox and Pyhs2 packages 
 installed along with Python •All free and open-source import pyhs2 import happybase connection = happybase.Connection('bigdatalite') flight_delays_hbase_table = connection.table('test1_flight_delays') b = flight_delays_hbase_table.batch(batch_size=10000) with pyhs2.connect(host='bigdatalite', port=10000, authMechanism="PLAIN", user='oracle', password='welcome1', database='default') as conn: with conn.cursor() as cur: #Execute query cur.execute("select * from flight_delays_initial_load") #Fetch table results for i in cur.fetch(): b.put(str(i[0]),{'dims:year': i[1], 'dims:carrier': i[2], 'dims:orig': i[3], 'dims:dest': i[4], 'measures:flights': i[5], 'measures:late': i[6], 'measures:cancelled': i[7], 'measures:distance': i[8]}) b.send()
  38. 38. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Hive SerDes & Storage Handlers •Plug-in technologies that extend Hive to handle new data formats and semi-structured sources •Typically distributed as JAR files, hosted on sites such as GitHub •Can be used to parse log files, access data in NoSQL databases, Amazon S3 etc CREATE EXTERNAL TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*]) 
 ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") 
 ([^ "]*|"[^"]*"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) STORED AS TEXTFILE LOCATION '/user/root/logs'; CREATE TABLE tweet_data( interactionId string, username string, content string, author_followers int) ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe' STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' WITH SERDEPROPERTIES ( 'mongo.columns.mapping'='{"interactionId":"interactionId", "username":"interaction.interaction.author.username", "content":"interaction.interaction.content", "author_followers_count":"interaction.twitter.user.followers_count"}' ) TBLPROPERTIES ( 'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' )
  39. 39. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Create Hive Table Metadata over HBase Tables •Create Hive tables over the HBase ones to provide SQL load/query capabilities ‣Uses HBaseStorageHandler Storage Handler for HBAse ‣HBase columns mapped to Hive columns using SERDEPROPERTIES CREATE EXTERNAL TABLE hbase_flight_delays (key string, year string, carrier string, orig string, dest string, flights string, late string, cancelled string, distance string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,dims:year,dims:carrier,dims:orig,dims:dest,
 measures:flights,measures:late,measures:cancelled,measures:distance") TBLPROPERTIES ("hbase.table.name" = "test1_flight_delays");
  40. 40. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Load and Query HBase using HiveQL •Use HiveQL commands INSERT INTO TABLE … SELECT to load (merge) new data •Use HiveQL SELECT query to retrieve data from HBase table insert into table hbase_flight_delays select * from flight_delays_initial_load; Total jobs = 1 ... Total MapReduce CPU Time Spent: 11 seconds 870 msec OK Time taken: 40.301 seconds select count(*), min(cast(key as bigint)) as min_key, max(cast(key as bigint)) as max_key from hbase_flight_delays; Total jobs = 1 ... Total MapReduce CPU Time Spent: 14 seconds 660 msec OK 200000 1 200000 Time taken: 53.076 seconds, Fetched: 1 row(s)
  41. 41. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Oracle Business Intelligence and Big Data Sources •OBIEE 11g from 11.1.1.7 can connect to Hadoop sources ‣OBIEE 11.1.1.7+ supports Hive/Hadoop as a data source, via specific Hive ODBC drivers
 and Apache Hive Physical Layer database type ‣But practically, it comes with limitations ‣Current 11.1.1.7 version of OBIEE only ships with HiveServer1 ODBC drivers ‣HiveQL is a limited subset of ISO/Oracle SQL ‣… and Hive access is really slow
  42. 42. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Querying Hive-on-HBase Tables •For small tables, query response-time is fast (HBase cell-level random access) •But response time increases (beyond regular HDFS-stored Hive tables) when facts included hive> select * from hbase_geog_dest where dest_city = 'San Francisco, CA'; OK JCC San Francisco, CA: China Basin Heliport San Francisco, CACalifornia 12457 SFO San Francisco, CA: San Francisco International San Francisco, CACalifornia 14771 Time taken: 0.171 seconds, Fetched: 2 row(s) hive> select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from flight_delays f > join hbase_geog_origin o on f.orig = o.key > and o.origin_state = 'California' > group by o.origin_airport_name; ... OK 17638Arcata/Eureka, CA: Arcata 9146Bakersfield, CA: Meadows Field 125433Burbank, CA: Bob Hope ... 1653Santa Maria, CA: Santa Maria Public/Capt. G. Allan Hancock Field Time taken: 51.757 seconds, Fetched: 27 row(s)
  43. 43. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com How Can This Be Improved On? •Gives the ability to supplement Hadoop data with reference data from Oracle, Excel etc •But response time is still quite slow •What about faster versions of Hive - Cloudera Impala for example? •Cloudera’s answer to Hive query response time issues •MPP SQL query engine running on Hadoop, bypasses MapReduce for direct data access •Mostly in-memory, but spills to disk if required •Uses Hive metastore to access Hive table metadata •Similar SQL dialect to Hive - not as rich though and no support for Hive SerDes, storage handlers etc
  44. 44. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Enabling Hive Tables for Impala •Log into Impala Shell, run INVALIDATE METADATA command to refresh Impala table list •Run SHOW TABLES Impala SQL command to view tables available •Run COUNT(*) on main ACCESS_PER_POST table to see typical response time [oracle@bigdatalite ~]$ impala-shell Starting Impala Shell without Kerberos authentication [bigdatalite.localdomain:21000] > invalidate metadata; Query: invalidate metadata Fetched 0 row(s) in 2.18s [bigdatalite.localdomain:21000] > show tables; Query: show tables +-----------------------------------+ | name | +-----------------------------------+ | access_per_post | | access_per_post_cat_author | | … | | posts | |——————————————————————————————————-+ Fetched 45 row(s) in 0.15s [bigdatalite.localdomain:21000] > select count(*) 
 from access_per_post; Query: select count(*) from access_per_post +----------+ | count(*) | +----------+ | 343 | +----------+ Fetched 1 row(s) in 2.76s
  45. 45. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Setting up an ODBC Connection to Impala •Download ODBC drivers for Impala from Cloudera Website ‣Windows, Linux, Mac, AIX •Create system DSN as normal, use port 21050 •Configure authentication ‣For unsecured cluster, use “No Authentication” ‣For secured, use Kerberos, etc •Test datasource to check successful connectivity •Complete on both Windows workstation, and server hosting BI Server component |
  46. 46. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Any Way We Can Improve This Further? •If available, use Oracle Big Data SQL to query Hive data only, or federated Hive + Oracle •Not available for customer project as customer uses Teradata •Access Hive data through Big Data SQL SmartScan feature, for Exadata-type response time •Use standard Oracle SQL across both Hive and Oracle data •Also extends to data in Oracle NoSQL database
  47. 47. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Existing OBIEE Reports Now Redirected To Hadoop Cluster
  48. 48. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Use Exalytics In-Memory Aggregate Cache if Required •If further query acceleration is required, Exalytics In-Memory Cache can be used •Enabled through Summary Advisor, caches commonly-used aggregates in in-memory cache •Options for TimesTen or Oracle Database 12c In-Memory Option •Returns aggregated data “at the speed of thought”
  49. 49. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  50. 50. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Next Project : Machine Learning & Predictive Modelling •Add merchant, financial, transaction data to the Hadoop-based data reservoir •Stream-in transactional, salesforce data for real-time alerting and opportunity spotting •Add predictive modeling, machine learning algorithms to gain further value
  51. 51. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Lessons Learned •Run a mile when then client doesn’t have an innovation lab, development cluster etc ‣You can’t deliver a PoC within a reasonable timeframe with Hadoop treated like PROD db •But also do your homework regarding packages, approach you’ll use ‣For banking-type clients, they want to inspect and authorise the approach you’ll take •Think-through the end-to-end data loading and processing, including real-time elements •Be prepared for issues around SQL dialects, bugs etc if using Hive, Impala vs. Oracle SQL
  52. 52. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Thank You for Attending! •Thank you for attending this presentation, and more information can be found at http:// www.rittmanmead.com •Contact us at info@rittmanmead.com or mark.rittman@rittmanmead.com •Look out for our book, “Oracle Business Intelligence Developers Guide” out now! •Follow-us on Twitter (@rittmanmead) or Facebook (facebook.com/rittmanmead)
  53. 53. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com “Big Data At Work” - 
 Two Customer Stories from the 
 Rittman Mead Big Data Team Mark Rittman, CTO, Rittman Mead March 2015

×