Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase

3,171 views
3,019 views

Published on

Big Data Analytics is characterized by analysis of data on three vectors: exploding data volume, proliferating data variety (relational, multi-media), and accelerating data velocity. However, other key vectors such as costs and skill set needed for Big Data Analytics are often overlooked. In this session, we will consider all five vectors by exploring various techniques where traditional but progressive technologies such as column store DBMS and Event Stream Processing is combined with open source frameworks such as Hadoop to exploit the full potential of Big Data Analytics.
Agenda:
- Big Data Analytics in the real world
- Commercial and Open Source techniques
- Bringing together Commercial and Open Source techniques
* Architectures
* Programming APIs

(e.g. embedded and federated MapReduce)
- Conclusions

Published in: Technology

Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase

  1. 1. BIG DATA ANALYTICS IN A HETEROGENEOUS WORLDJOYDEEP DASDIRECTOR, ANALYTICS PRODUCT MANAGEMENTSYBASE INC, AN SAP COMPANYFEBRUARY 16, 2012
  2. 2. AGENDA The real world means businessChange is afoot – Myriad solution trendsBuilding bridges across a heterogeneous world Summary
  3. 3. BIG DATA ANALYTICSREAL WORLD MEANS BUSINESS
  4. 4. BIG DATA ANALYTICS ISSUESDEALING WITH VOLUME, VARIETY, VELOCITY, COSTS, SKILLS Volume Managing and harnessing massive data sets Skills Variety Lack of adequate BIG Harmonizing silos of skills for popular structured and APIs DATA unstructured data ANALYTICS Costs Velocity Too expensive to Keeping up with acquire, operate, unpredictable data and expand and query flows
  5. 5. BIG DATA ANALYTICS MATURITYFROM JARGON TO TRANSFORMATIONAL BUSINESS VALUE* New Strategies & Business Models Column Store Hadoop Big data NoSQL In memory Business data MPP Value* Operational Revenue Efficiencies Growth *A McKinsey study titled “Big Data: Next frontier for innovation, competition, and productivity”, May 2011, has found huge potential for Big Data Analytics with metrics as impressive as 60% improvements in Retail operating margins, 8% reduction in (US) national healthcare expenditures, and $150M savings in operational efficiencies in European economies
  6. 6. BIG DATA ANALYTICS IN THE REAL WORLDPREVALENT IN DATA INTENSIVE VERTICALS AND FUNCTIONAL AREAS BIG DATA Verticals ANALYTICS Functional Banking • Marketing Analytics Digital channels Track visits, discover best channel mix: Telcom, email, social media, search • Sales Analytics Global Capital Markets Deep correlations Predict risks based on deal DNA (emails, Retail meetings) pattern match • Operational Analytics Government Atomic machine data Analyze RFIDs, weblogs, SMS, sensors — continuous operational inefficiency Healthcare • Financial Analytics Detailed simulations Information Providers Liquidity, portfolio simulations — Stress tests, error margins
  7. 7. CHANGE IS AFOOTMYRIAD OF BIG DATA ANALYTICS SOLUTIONS
  8. 8. CAUSAL LINKS: VARIETY, VELOCITY, VOLUME Events data Transactional data µSeconds Multi-media data eCommerce data Continuous and/or Bursts Routinely Petabytes x w a y z Graph data Variety Velocity Volume
  9. 9. GROWING USER COMMUNITIES Data Scientists Business Analysts Developers/Programmers Administrators Business users External consumers Business Processes
  10. 10. HARDWARE IS SUPERIOR Small Server farms – Scale out Larger Servers with partitions – Scale up Spinning disks to SSDs SSD 1.2x to 2x speed up SSD SSDs to Main Memory 4x to 200x speed up Main Memory to CPU caches CPU Caches 2x to 6x speed up
  11. 11. SOFTWARE EXPECTATIONS HAVE CHANGED Intelligence & Automation Execution Characteristics Performance & Scalability Results Characteristics Traditional Contemporary
  12. 12. EXECUTION CHARACTERISTICSPERFORMANCE FOCUS 1 2 3 4 5 6 7 8 9… r1 r2 r3 r4 r5 Columnar access MPP: Shared Nothing, Shared Everything Algorithms Computations close to data: InDB Analytics (MapReduce), FPGA filtering In-memory processing
  13. 13. EXECUTION CHARACTERISTICSSCALABILITY FOCUS 1 2 3 4 5 6 7 8 9… r1 Data Compression r2 r3 r4 r5 Natural Compression Compression Techniques Hybrid Columnar Column Store Databases Row Store Databases Compression Databases SAN Distributed File Systems DAS NAS Stream Processing Engines Data Filtering Pre-processing Engines Transformation Engines
  14. 14. EXECUTION CHARACTERISTICSINTELLIGENCE FOCUS Query & Load Optimization On-demand systems: Virtualization and provisioning CPU Caches CPU Cache Conscious Computations
  15. 15. EXECUTION CHARACTERISTICSAUTOMATION FOCUS Data conscious federation Automatic Workload Balancer/Mixer User community focused collaborative services
  16. 16. RESULTS CHARACTERISTICSACCURACY TOLERANCE FOCUS  Complex schemas Multiple applications Write on schema  Atomic level locking  Consistency guarantees across system losses  Declarative API  Interactive  Does encapsulate elements of CAP Traditional  Associated with SQL Simple read on schemas  Single application Batch oriented  Snapshot isolations  Eventual consistency guarantees  Procedural APIs  Does encapsulate elements of ACID Contemporary Associated with NoSQL
  17. 17. BUILDING BIG DATA BRIDGESACROSS A HETEROGENEOUS WORLD
  18. 18. COMPREHENSIVE 3-TIER FRAMEWORKCOMMERCIAL AND/OR OPEN SOURCE Eco-System Business Intelligence Tools, Data Integration Tools, DBA Tools, Packaged Apps Application Services In-Database Analytics, Multi-lingual Client APIs, Federation, Web Enabled Data Management High Performance, Highly Scalable, Cloud Enabled
  19. 19. RELIABLE DATA MANAGEMENT Full Mesh High Speed Interconnect DataManagement  Can handle high performance, compression, batch, ad-hoc analysis  Can routinely scale to Petabyte class problems, thousands of concurrent jobs  Typical characteristics  Massively parallel processing of complex queries  In-memory and on-disk optimizations  Elastic resources for user communities  ACID guarantees  Data variety  Information lifecycle management  User friendly automation tools  File systems (schema free) and/or DBMS structures (schema specific)
  20. 20. DATA MANAGEMENT INFRASTRUCTUREROBUST, SCALABLE, HIGH PERFORMANCE Data Discovery Application Modeling Reports/Dashboards Business Decisions (Data Scientists) (Business Analysts) (BI Programmers) (Business End Users) Infrastructure Management Full Mesh High Speed Interconnect (DBAs) • Dynamic, elastic MPP grid – Grow, shrink, provision on-demand – Heavy parallelization • Load, prepare, mine, report in a workflow – Privacy through isolation of resources – Collaboration through sharing of results/data via sharing of resources
  21. 21. VERSATILE APPLICATION SERVICES Python ADO.NET PERL Programming PHP Ruby Java C++ APIs Web Services APIApplication Services In-Database Analytics Plug-Ins: SQL, PMML, C++, JAVA, …  Comprehensive declarative and procedural APIs  In-Database Analytics Plug-In APIs  In-Database Web Services  Query and data federation APIs  Multi-lingual Client APIs
  22. 22. VERSATILE APPLICATION SERVICESRICH ALGORITHMS CLOSE TO DATA Sybase IQ Process In Memory Sybase IQ Process RPC CALLS In Memory User’s DLL “A” User’s DLL “B” Library Access Process LOAD User’s DLL “A” User’s DLL “B” LOAD User’s DLL “B” User’s DLL “B” In-database + In-process Multi-lingual APIs • In-process dynamically loaded In-database + Out-process Scalar to Scalar shared libraries • Out of process shared library Scalar sets to Aggregate • Highest possible performance Scalar sets to Dimensional • Lower security risks • Incurs security risks, but Aggregates manageable via privileges • Lower robustness risks Scalar sets to Multi-attribute (bulk) • Incurs robustness risks, but • Lower performance than in- Multi-attribute (bulk) to manageable via multiplex process but better than out of Multi-attribute (bulk) database
  23. 23. VERSATILE APPLICATION SERVICESNATIVE MAPREDUCE For stocks in enterprise software sector, find max relative strength of a stock for a trading day* Key (k1) Value (v1) Key (k2) Value (v2) Ticker 30-min interval Weighted variance = (A given stock’s variance 30-min Ticker TickValu TickValue Symbol time / Average Variance across All “N” stocks) interval time Symbol e Day 1 Day 2 SAP 9:30 am +1.4 / (SUM (+1.4-2.8-0.7….)/”N” stocks) 9:30 am SAP 51 52.4 SAP 10:00 am +2.2 / (SUM (+2.2-2.3-1.1 ….)/”N” stocks) 9:30 am ORCL 31 28.2 Map SAP …… …… 9:30 am TDC 22 21.3 Fn ORCL 9:30 am -2.8 / (SUM (+1.4-2.8-0.7….)/”N” stocks) ORCL 10:00 am -2.3 / (SUM (+2.2-2.3-1.1 ….)/”N” stocks) 10:00 am SAP 50.9 53.1 ORCL ……. ….. 10:00 am TDC 21.8 20.9 TDC 9:30 am -0.7 / (SUM (+1.4-2.8-0.7….)/”N” stocks) 10:00 am ORCL 29.4 27.1 TDC 10:00 am -1.1/ (SUM (+2.2-2.3-1.1 ….)/”N” stocks) ….. ORCL …… ….. TDC ….. …… Reduce Fn Value (v3) Ticker Symbol Max Absolute Weighted Variance (v3) SAP Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..) ORCL Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..) TDC Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..) *Calculate max variance for the day by comparing each 30-min interval tick values across two days: the trading day & the day before, weighted by average variance of all stocks for each 30-min interval
  24. 24. VERSATILE APPLICATION SERVICESNATIVE MAPREDUCE – DECLARATIVE WAY For stocks in enterprise software sector, find max relative strength of a stock for a trading day • Map function declaration: CREATE PROCEDURE MapVarTPF (IN XY TABLE (a1 char, a2 datetime, a3 float, a4 float) RESULT SET YZ (b1 char, b2 datetime, b3 float) • Reduce function declaration: CREATE PROCEDURE RedMaxVarTPF (IN XY TABLE (a1 char, a2 datetime, a3 float) RESULTE SET YZ (b1 char, b2 float) • Query: SELECT RedMaxVarTPF.TickSymb, RedMaxVarTPF.MaxVar, FROM RedMaxVarTPF (TABLE (SELECT MapVarTPF.TickSymb, MapVarTPF.30MinIntTime, MapVarTPF.Var FROM MapVarTPF (TABLE ( SELECT TickDataTab.TickSymb, TickDataTab.30MinIntTime, TickDataTab.30MinValDay1, TickDataTab.30MinValDay2) OVER (PARTITION BY TickDataTab.30MinInt))) OVER (PARTITION BY MapVarTPF.TickSymb)) ORDER BY RedMaxVarTPF.TickSymb • Native MapReduce parallel execution workflow: MapVarTPF (Partitioned to 15 parallel instances) RedMaxVarTPF (Partitioned to 25 parallel instances) SQL Query collates output using 1 node ……. ……. ….. SAN Fabric SAN Fabric SAN Fabric • Native MapReduce with unstructured data: Native MapReduce using can easily be applied to unstructured data also e.g. text, multi-media, … stored in DBMS or to unstructured data brought into DBMS during execution time from external files
  25. 25. RICH ECO-SYSTEM Source Answers Data preparation Data Usage Eco-System DBMS / Filesystem Event Processing Data Federation Business Intelligence Data Modeling / Database Design Tool  Business Intelligence Tools  Data Integration Tools  Data Mining Tools  Application Tools  DBA Tools
  26. 26. RICH ECO-SYSTEMDBMS <–> HADOOP BRIDGE I Feature Characteristics Big Data Use Cases • Client tool capable of querying •Ideal for bringing together Big Data DBMS and Hadoop Analytics pre-computations from different domains • Better performance when results from sources are pre- • Example – In Telecommunication: DBMS Client Side Federation: Join data computed/pre-aggregated has aggregated customer loyalty data & Hadoop with aggregated networkfrom DBMS AND Hadoop at a client utilization data; Quest Toad for Cloud can application level bring data from both sources, linking customer loyalty to network utilization or network faults (e.g. dropped calls) Quest Toad for Cloud DBMS Hadoop/Hive
  27. 27. RICH ECO-SYSTEMDBMS <–> HADOOP BRIDGE II Feature Characteristics Big Data Use Cases • Extract & load subsets of HDFS data into DBMS store • Raw data from HDFS • Results of Hadoop MR jobs • Ideal for combining subsets of HDFS ETL unstructured data or summary of • HDFS Data stored in DBMS is HDFS data into DBMS for mid to long treated like other DBMS data term usage in business reports Load Hadoop Data into DBMS • Gets ACID properties of a DBMScolumn store: Extract, Transform, • Can be indexed, joined, parallelized • Example – In eCommerce: clickstream data • Can be queried in an ad-hoc way from weblogs stored in HDFS and outputs of Load data from HDFS (Hadoop MR jobs on that data (to study browsingDistributed File System) into DBMS • Visible to BI and other client tools behavior) ETL’d into DBMS. The transactional schemas sales data in DBMS joined with clickstream data via DBMS ANSI SQL API only to understand and predict customer browsing to buying behavior • Currently, the bulk data transfer utility SQOOP (built by Cloudera) is can be used provide this ETL capability Clickstream Data Sales Data Hadoop/Hive SQOOP DBMS
  28. 28. RICH ECO-SYSTEMDBMS <–> HADOOP BRIDGE III Feature Characteristics Big Data Use Cases • Scan and fetch specified data • Ideal for combining subsets of HDFS subsets from HDFS via table UDF data with DBMS data for operational • Can read and fetch HDFS data subsets • Called as part of SQL query (transient) business reports • Output joinable with DBMS dataJoin HDFS data with DBMS data on • Multiple, simultaneous UDF calls possible • Example – In Retail: Point Of Sale (POS) the fly: Fetch and join subsets of • Sample UDFs provided in JAVA, C++ detailed data stored in HDFS. DBMS EDW fetches POS data at fixed intervals from HDFS of HDFS data on-demand using SQL queries from DBMS(Data • HDFS data not stored in DBMS specific hot selling SKUs, combines with inventory data in DBMS to predict and prevent • Fetched into DBMS In-memory tables inventory “stockouts”. Federation technique) • ACID properties not applicable • Repeated use: put fetched data in tables • Visible to BI/other client tools via ANSI SQL API only Inventory POS Data Data Hadoop/HDFS UDF Bridge DBMS
  29. 29. RICH ECO-SYSTEMDBMS <–> HADOOP BRIDGE IV Feature Characteristics Big Data Use Cases • Trigger and fetch Hadoop MR job • Ideal for combining results of results via table UDF Hadoop MR job results with DBMS • Can trigger Hadoop MR jobs • Called as part of Sybase IQ SQL query data for operational (transient) • Output joinable with Sybase IQ data business reports • No multiple, simultaneous UDF calls • Sample UDFs provided in JAVA onlyCombine results of Hadoop MR jobs • Example – In Utilities: Smart meter and with DBMS data on the fly: Initiate smart grid data can be combined for load and Join results of Hadoop MR jobs • HDFS data not stored in DBMS monitoring and demand forecast. Smart grid on-demand using SQL queries from • Fetched into DBMS In-memory tables transmission quality data (multi-attribute time • ACID properties not applicable series data) stored in HDFS can be computed DBMS data (Query Federation • Repeated use: put fetched data in tables via Hadoop MR jobs triggered from DBMS and technique) combined with Smart meter data stored in DBMS to analyze demand and workload. • Visible to BI and other client tools via DBMS ANSI SQL API only Smart Grid Smart Meter Transmission Data consumption data Hadoop/HDFS UDF Bridge DBMS
  30. 30. RICH ECO-SYSTEMDBMS <–> PREDICTIVE TOOLS BRIDGE Express Complex Computations In Industry Standard Predictive Modeling Markup Language (PMML), Plug In Models Close To data for execution Database Server DBMS SQL Applications Bridge Universal Predictions Plug-In PMML UDFs PMML PMML PMML (models) PMML Preprocessor (models) (models) (convert & validate)
  31. 31. RICH ECO-SYSTEMFUNDAMENTALS OF STREAMS TECHNOLOGY Process data without storing it Input Streams Events arrive on input streams Derived Streams, Windows Apply continuous query operators to one or more input streams to produce a new stream Continuous Queries create a new Windows can Have State “derived” stream or window • Retention rules define how many or how long events are kept SELECT FROM one or more input • Opcodes in events can indicate streams/windows insert/update/delete and can be WHERE… automatically applied to the window GROUP BY…
  32. 32. RICH ECO-SYSTEMSTREAMS DATA PROCESSING VS TRADITIONAL DATA PROCESESSING SQL CCL Windows on Tables Event Streams Rows Events Columns Fields On-Demand: query Event-Driven: runs when information query updates when is needed information arrives
  33. 33. RICH ECO-SYSTEMSTREAMS PRE-PROCESSING Why store Big Data when you can deal with Small Data – Pre-filter un-necessary data on the fly with Streams technologies ESP Engine Alerts Actions Updates Memory Disk Hadoop/HDFS DBMS
  34. 34. SUMMARY
  35. 35. 3-LAYER LOGICAL INTEGRATION STREAM PROCESSING <-> NoSQL <-> DBMS BI TOOLS DI TOOLS DBA TOOLS DATA MINING TOOLSEco-System Unstructured Data App Ingest + Persist (Hadoop, Services Web 2.0 Java C/C++ SQL Federation Content Mgmt) Structured Data (DBMS) DMBS Streaming Data (ESP) The heterogeneous world will require co-existence and playing nice!
  36. 36. Q&ALearn More: http://www.sybase.com/sybaseiqbigdataContact: 1-800-SYBASE5 (792.2735)

×