Your SlideShare is downloading. ×
Big Data Processing Using Hadoop Infrastructure
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big Data Processing Using Hadoop Infrastructure


Published on

Big Data Processing Using Hadoop Infrastructure

Big Data Processing Using Hadoop Infrastructure

Published in: Software
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Big Data processing using Hadoop infrastructure
  • 2. Use case Intrum Justitia SDC • 20 countries / different applications to process, store and analyze data • Non-unified data storage formats • High number of data objects (Records, Transactions, Entities) • May have complex strong or loose relation rules • Often: involving time-stamped events, made of incomplete data 2(24)
  • 3. Possible solutions • Custom built from ground solution • Semi-clustered approach ‒ Tools from Oracle ‒ MySQL/PostgreSQL nodes ‒ Document-oriented tools like MongoDB • “Big Data” approach (Map-Reduce) 3(24)
  • 4. Map-Reduce • Simple programming model that applies to many large-scale computing problems • Availabe in MongoDB for sharded data • MapReduce tools usually offers: ‒ automatic parallelization ‒ load balancing ‒ network/disk transfer optimization ‒ handling of machine failures ‒ robustness • Introduced by Google, open-source implementation by Apache (Hadoop), enterprise support by Cloudera 4(24)
  • 5. Cloudera manager YARN Resource manager CDH ecosystem 5(24) HDFS MapReduce HBase NoSQL Hive HQL Sqoop Import ExportParquet Impala SQL / ODBC Pig Pig Latin Zookeeper Coordination Hue
  • 6. HDFS • Hadoop Distributed File System • Redundancy • Fault Tolerant • Scalable • Self Healing • Write Once, Read Many Times • Java API • Command Line Tool • Mountable (FUSE) 6(24)
  • 7. HDFS file read Code to data, not data to code 7(24) Client application HDFS client Name node /bob/file.txt Block A Block B DataNode 2 DataNode 3 DataNode 1 DataNode 3 DataNode 1 C B D DataNode 2 C A D DataNode 3 C B A 1 4 4 2 3
  • 8. Map-Reduce workflow and redundancy (6) Write User Program Master Worker Worker Worker Split 0 Split 1 Split 2 Split 3 Split 4 Worker Worker Output File 0 Output File 1 (1) Fork(1) Fork(1) Fork (2) Assign map (2) Assign reduce (3) Read (4) Local write Input files MAP phase Intermediate files REDUCE phase Output files 8(24)
  • 9. 9(24) Hive Pig • High-level data access language • Data Warehouse System for Hadoop • Data Aggregation • Ad-Hoc Queries • SQL-like Language (HiveQL) • High-level data access language (zero Java knowledge required) • Data preparation (ETL) • Pig Latin scripting language SQL Hive MapReduce Pig Latin Pig MapReduce
  • 10. HiveQL vs. Pig Latin insert into ValClickPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on ( = clicks.user) where value > 0; ) using ipaddr group by dma; Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value > 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, count(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA'; 10(24) Pig Latin is procedural, where HQL is declarative.
  • 11. Impala • Real time queries ~100x faster comparing to Hive • Direct data access • Query data on HDFS or HBase • Allows table joins and aggregation 11(24) ODBC Impala HDFS HBase
  • 12. Parquet • Row Groups: A group of rows in columnar format ‒ One (or more) per split while reading ‒ Max size buffered in memory while writing ‒ About 50MB < row group < 1GB • Columns Chunk: Data for one column in row group ‒ Column chunks can be read independently for efficient scans • Page: Unit of access in a column chunk ‒ Should be big enough for efficient compression ‒ Min size to read while accessing a single record ‒ About 8KB < page < 1MB Lars George, Cloudera. Data I/O, 2013 12(24)
  • 13. HBase Column-oriented data storage. Very large tables – billions of rows X millions of columns. • Low Latency • Random Reads And Writes (by PK) • Distributed Key/Value Store; automatic region sharding • Simple API ‒ PUT ‒ GET ‒ DELETE ‒ SCAN 13(24)
  • 14. HBase building blocks • The most basic unit in HBase is a column ‒ Each column may have multiply versions with each distinct value contained in separate cell ‒ One or more columns from a row, that is addressed uniquely by a row key ‒ Can have millions of columns ‒ Can be compressed or tagged to stay in memory • A table is a collection of rows ‒ All rows and columns are always sorted lexicographically by their row key 14(24)
  • 15. HBase read 15(24) Client ZooKeeper HMaster RegionServer RegionServer RegionServer
  • 16. Hadoop Oozie • Oozie is a workflow scheduler system to manage Hadoop jobs. • Workflow is a collection of actions • Arranged in Directed Acyclic Graph 16(24) Job submission Oozie server "All done" MapReduce Pig ... Schedule Result Schedule Failure Re-schedule Result
  • 17. Hadoop v1 architecture • JobTracker ‒ Manage Cluster Resources ‒ Job Scheduling • TaskTracker ‒ Per-node agent ‒ Task management • Single purpose – batch processing 17(24)
  • 18. Hadoop v2 - YARN architecture • ResourceManager – Allocates cluster resources • NodeManager – Enforces node resource allocations • ApplicationMaster – Application lifecycle and task scheduler • Multi-purpose system, Batch, interactive querying, streaming, aggregation 18(24) Resource Manager MapReduce Status Job Submission Client Node Manager Node Manager Container Node Manager App Mstr Node Status Resource Request
  • 19. Hadoop infrastructure integration 18(24) TxB IW extract program IW extract program IW extract program TxB TxB Data In HDFS RAW (S)FTP SCP HTTP(S) JDBC HDFS Binary HDFS Results Data Out Intrum Web PAM GSS Catalyst Dashboard Parsing&Validation Conversion&Compression DataQualityAnalysis BusinessAnalytics DataTransformation DataDelivery Monitoring and Management Hadoop Cluster
  • 20. Development environment integration Generic enterprise stack • Maven • Spring, Spring-Hadoop • Hibernate • H2, MySQL cluster • LDAP, Kerberos • CI (Jenkins ...) 20(24)
  • 21. Java example: Hadoop task @Component public class HBaseEventLogToMySQL extnds Configured implements Tool { @Autowired private EntityManagerFactory entityManagerFactory; @Override public int run(String[] args) throws Exception { LogAbstractEvent lastEvent = getLastMySQLEvent(); Scan scan; String lastEventKey = ""; if (lastEvent == null) { scan = new Scan(); } else { lastEventKey = lastEvent.getEventKey(); scan = new Scan(Bytes.toBytes(lastEventKey + Character.MAX_VALUE)); } final Configuration conf = HBaseConfiguration.create(getConf()); HTable table = new HTable(conf, tableName); ResultScanner resultScanner = table.getScanner(scan); readRowsToMySQL(resultScanner); } 21(24)
  • 22. Java example: Map part (Table) public class BasicProdStatusHbaseMapper extends TableMapper<Text, MapWritableComparable> { @Override public void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException { byte[] caseByteArr = value.getValue(FAMILY, CASE_QUALIFIER); Map<> caseMap = HbaseUtils.convertCaseColumnByteArrToMap(caseByteArr); MapWritableComparable map = new MapWritableComparable(); map.put(new Text("originalCapital"), new DoubleWritable((Double)caseMap.get("OriginalCapital"))); map.put(new Text("remainingCapital"), new DoubleWritable((Double)caseMap.get("RemainingCapital"))); context.getCounter(COUNTERS_COMMON_GROUP, COUNTER_CASES_MAPPED).increment(1); context.write(new Text(mainStatusCode), map); }} 22(24)
  • 23. Java example: Reduce part public class BasicProdStatusHbaseReducer extends Reducer<Text, MapWritableComparable, BasicProdStatusWritable, NullWritable> { @Override protected void reduce(Text key, Iterable<MapWritableComparable> values, Context context) throws IOException, InterruptedException { String mainStatusCode = key.toString(); AggregationBean ab = new AggregationBean(); for (MapWritableComparable map : values){ double originalCapital = ((DoubleWritable)map.get(new Text("originalCapital"))).get(); double remainingCapital = ((DoubleWritable)map.get(new Text("remainingCapital"))).get(); ab.add(originalCapital,remainingCapital); } context.write(ab.getDBObject(mainStatusCode,pid), NullWritable.get()); context.getCounter(COMMON_GROUP, CASES_PROCESSED).increment(1); }} 23(24)
  • 24. Q&A We are hiring! Big Data processing using Hadoop infrastructure