Source和Channel组件已经分析过,接下来看Sink组件。
Flume内置了很多Sink,比如HDFS Sink, Hive Sink, File Roll Sink, HBase Sink, ElasticSearch Sink等。
Sink接口
public interface Sink extends LifecycleAware, NamedComponent {
public void setChannel(Channel channel);
public Channel getChannel();
public Status process() throws EventDeliveryException;
public static enum Status {
READY, BACKOFF
}
}
沒啥好讲的,跟之前的都类似。
Sink实现类AbstractSink:
abstract public class AbstractSink implements Sink, LifecycleAware
跟AbstractSource类似。
Flume内置的Sink例子
HDFSEventSink
1个简单的hdfs sink配置:
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
HDFSEventSink中的process(Sink接口提供的)方法记录着如何写入hdfs文件:
public Status process() throws EventDeliveryException {
// 得到Channel
Channel channel = getChannel();
// 得到Channel里的Transaction,接下来事件的提交,回滚都基于这个Transaction
Transaction transaction = channel.getTransaction();
List<BucketWriter> writers = Lists.newArrayList();
// 事务开始,一般的Transaction实现类不会覆盖这个方法,除非有特殊要求,begin方法默认的实现不做任何事
transaction.begin();
try {
int txnEventCount = 0;
// 每次操作都处理batchSize个事件,batchSize可配置
for (txnEventCount = 0; txnEventCount < batchSize; txnEventCount++) {
// channel的take方法内部会使用Transaction的take方法,Transaction回滚后这些take出来的事件全部都会回滚到Channel里
Event event = channel.take();
if (event == null) {
break;
}
// reconstruct the path name by substituting place holders
String realPath = BucketPath.escapeString(filePath, event.getHeaders(),
timeZone, needRounding, roundUnit, roundValue, useLocalTime);
String realName = BucketPath.escapeString(fileName, event.getHeaders(),
timeZone, needRounding, roundUnit, roundValue, useLocalTime);
String lookupPath = realPath + DIRECTORY_DELIMITER + realName;
BucketWriter bucketWriter;
HDFSWriter hdfsWriter = null;
// Callback to remove the reference to the bucket writer from the
// sfWriters map so that all buffers used by the HDFS file
// handles are garbage collected.
WriterCallback closeCallback = new WriterCallback() {
@Override
public void run(String bucketPath) {
LOG.info("Writer callback called.");
synchronized (sfWritersLock) {
sfWriters.remove(bucketPath);
}
}
};
synchronized (sfWritersLock) {
bucketWriter = sfWriters.get(lookupPath);
// we haven't seen this file yet, so open it and cache the handle
if (bucketWriter == null) {
hdfsWriter = writerFactory.getWriter(fileType);
bucketWriter = initializeBucketWriter(realPath, realName,
lookupPath, hdfsWriter, closeCallback);
sfWriters.put(lookupPath, bucketWriter);
}
}
// track the buckets getting written in this transaction
if (!writers.contains(bucketWriter)) {
writers.add(bucketWriter);
}
// Write the data to HDFS
try {
bucketWriter.append(event);
} catch (BucketClosedException ex) {
LOG.info("Bucket was closed while trying to append, " +
"reinitializing bucket and writing event.");
hdfsWriter = writerFactory.getWriter(fileType);
bucketWriter = initializeBucketWriter(realPath, realName,
lookupPath, hdfsWriter, closeCallback);
synchronized (sfWritersLock) {
sfWriters.put(lookupPath, bucketWriter);
}
bucketWriter.append(event);
}
}
if (txnEventCount == 0) {
sinkCounter.incrementBatchEmptyCount();
} else if (txnEventCount == batchSize) {
sinkCounter.incrementBatchCompleteCount();
} else {
sinkCounter.incrementBatchUnderflowCount();
}
// flush all pending buckets before committing the transaction
for (BucketWriter bucketWriter : writers) {
bucketWriter.flush();
}
// 写完数据无误后,commit,清空Transaction里的数据
transaction.commit();
if (txnEventCount < 1) {
return Status.BACKOFF;
} else {
sinkCounter.addToEventDrainSuccessCount(txnEventCount);
return Status.READY;
}
} catch (IOException eIO) {
// 发送异常回滚
transaction.rollback();
LOG.warn("HDFS IO error", eIO);
return Status.BACKOFF;
} catch (Throwable th) {
// 发送异常回滚
transaction.rollback();
LOG.error("process failed", th);
if (th instanceof Error) {
throw (Error) th;
} else {
throw new EventDeliveryException(th);
}
} finally {
// 事务结束,一般的Transaction实现类不会覆盖这个方法,除非有特殊要求,close方法默认的实现不做任何事
transaction.close();
}
}
下面我们重点分析一下HDFS Sink如何写入数据到hdfs文件。
分析之前,有几个配置跟是否创建hdfs新文件有关,当达到这些配置的条件后,sink会关闭文件,然后重新创建1个新文件重复执行
hdfs.rollInterval -> 每隔多少秒会关闭文件。默认30秒
hdfs.rollSize -> 文件大小到达一定量后,会关闭文件。单位bytes。默认1024,如果是0的话表示跟文件大小无关
hdfs.batchSize -> 批次数,Sink每次处理都会处理batchSize个事件, 不会创建新文件。默认100个
hdfs.rollCount -> 事件处理了rollCount个后,会关闭文件。默认10个,如果是0的话表示跟事件个数无关
hdfs.minBlockReplicas -> hadoop中的dfs.replication配置属性,表示复制块的个数,默认会根据hadoop的配置
HDFS Sink每个根据batchSize,遍历channel中的事件,针对每个Event的生成地址构造BucketWriter。然后在sfWriters这个Map中以文件的路径为key,BucketWriter为value进行存储。
接下来使用BucketWriter的append方法,当batchSize个事件遍历完成,调用BucketWriter的flush方法。
bucketWriter.append(event);
for (BucketWriter bucketWriter : writers) {
bucketWriter.flush();
}
BucketWriter的append方法部分代码如下:
// 打开文件
if (!isOpen) {
if (closed) {
throw new BucketClosedException("This bucket writer was closed and " +
"this handle is thus no longer valid");
}
open();
}
// 是否需要创建新文件
if (shouldRotate()) {
boolean doRotate = true;
if (isUnderReplicated) {
if (maxConsecUnderReplRotations > 0 &&
consecutiveUnderReplRotateCount >= maxConsecUnderReplRotations) {
doRotate = false;
if (consecutiveUnderReplRotateCount == maxConsecUnderReplRotations) {
LOG.error("Hit max consecutive under-replication rotations ({}); " +
"will not continue rolling files under this path due to " +
"under-replication", maxConsecUnderReplRotations);
}
} else {
LOG.warn("Block Under-replication detected. Rotating file.");
}
consecutiveUnderReplRotateCount++;
} else {
consecutiveUnderReplRotateCount = 0;
}
if (doRotate) {
close();
open();
}
}
try {
sinkCounter.incrementEventDrainAttemptCount();
// 写hdfs文件
callWithTimeout(new CallRunner<Void>() {
@Override
public Void call() throws Exception {
writer.append(event); // could block
return null;
}
});
} catch (IOException e) {
LOG.warn("Caught IOException writing to HDFSWriter ({}). Closing file (" +
bucketPath + ") and rethrowing exception.",
e.getMessage());
try {
close(true);
} catch (IOException e2) {
LOG.warn("Caught IOException while closing file (" +
bucketPath + "). Exception follows.", e2);
}
throw e;
}
// 更新一些属性
processSize += event.getBody().length;
eventCounter++;
batchCounter++;
// 写入的事件达到批次数,flush
if (batchCounter == batchSize) {
flush();
}
shouldRotate方法表示是否需要创建新文件:
private boolean shouldRotate() {
boolean doRotate = false;
// 先判断HDFS是否正在复制块,优先级最高,也就是配置文件的minBlockReplicas
if (writer.isUnderReplicated()) {
this.isUnderReplicated = true;
doRotate = true;
} else {
this.isUnderReplicated = false;
}
// 配置文件中的rollCount,也就是事件个数
if ((rollCount > 0) && (rollCount <= eventCounter)) {
LOG.debug("rolling: rollCount: {}, events: {}", rollCount, eventCounter);
doRotate = true;
}
// 配置文件中的rollSize,也就文件大小
if ((rollSize > 0) && (rollSize <= processSize)) {
LOG.debug("rolling: rollSize: {}, bytes: {}", rollSize, processSize);
doRotate = true;
}
return doRotate;
}
有的同学可能因为没有配置minBlockReplicas,而配置了其他属性,所以hdfs还是会生成很多文件,这是因为minBlockReplicas的优先级最高,如果当前正在复制块,其他所有的条件都会被无视。
flush方法:
public synchronized void flush() throws IOException, InterruptedException {
checkAndThrowInterruptedException();
if (!isBatchComplete()) {
doFlush();
if(idleTimeout > 0) {
// if the future exists and couldn't be cancelled, that would mean it has already run
// or been cancelled
if(idleFuture == null || idleFuture.cancel(false)) {
Callable<Void> idleAction = new Callable<Void>() {
public Void call() throws Exception {
LOG.info("Closing idle bucketWriter {} at {}", bucketPath,
System.currentTimeMillis());
if (isOpen) {
close(true);
}
return null;
}
};
idleFuture = timedRollerPool.schedule(idleAction, idleTimeout,
TimeUnit.SECONDS);
}
}
}
}
总结:
HDFSSink里的hdfs.rollSize,hdfs.rollCount,hdfs.minBlockReplicas,hdfs.rollInterval,hdfs.batchSize这些配置都是在BucketWriter中才会使用。
hdfs.rollSize, hdfs.rollCount和hdfs.minBlockReplicas, hdfs.rollInterval这4个属性决定是否创建新文件。
hdfs.batchSize决定是否flush文件数据。 由于HDFS Sink中每次最多只有batchSize个事件,因此BucketWriter会事件个数达到了batchSize后直接flush数据即可。