使用Java实现查找并移除字符串中的Emoji
作者:LetterZ
Emoji 实际上是 UTF-8 (Unicode) 字符集上的特殊字符,这篇文章主要介绍了如何使用Java实现查找并移除字符串中的Emoji,感兴趣的可以了解下
一、基础知识
- Emoji 实际上是 UTF-8 (Unicode) 字符集上的特殊字符,多数基本 Emoji 都被分配到 Unicode 编码表 1 号平面的 U+1F300–1F6FF 和 U+1F900–1FAFF 两个区域,由2个字符组成。
- 肤色修饰:大多数与人相关的 Emoji 默认是黄色的,所以后来引入了五个新码点作为修饰符:
U+1F3FB
、U+1F3FC
、U+1F3FD
、U+1F3FE
、U+1F3FF
。肤色修饰符追加到现有的 Emoji 后形成新的样式:U+1F44B
(👋 ) +U+1F3FD
= 👋🏽 - 符号变体或组合:一个普通的字后连接一个或多个变体、组合标识(字符),组合形成的 Emoji :
U+25C0
+U+FE0F
= ◀️U+27A1
+U+FE0F
= ➡️1
+U+FE0F
+U+20E3
= 1️⃣ - 国旗:每个国旗由2个地区标识符组合而成,地区标识符的对应码点范围为
U+1F1E6
~U+1F1FF
,等同于2个指定范围的普通 Emoji 字符组成。U+1F1E8
+U+1F1F3
= 🇨🇳 - 零宽度连接符(ZWJ):多个基础 Emoji 通过零宽度连接符(
U+200D
)形成的复杂 Emoji: 👩+U+200D
+🌾= 👩🌾 👩+U+200D
+❤️+U+200D
+👩= 👩❤️👩 👨+U+200D
+❤️+U+200D
+💋+U+200D
+👨= 👨❤️💋👨 - 序列:一个基础 Emoji 加上多个标签字符 (
U+E0020
~U+E007F
)并以 Tag Cancel(U+E007
)结尾,组合形成一个复杂 Emoji:U+1F3F4
(🏴) +U+E0067
+U+E0062
+U+E0065
+U+E006E
+U+E0067
+U+E007F
= 🏴 - 特殊符号: 特殊符号只有1个字符,有些符号在某些环境下会被当做Emoj处理:⏯、⏫、⏹;
Unicode 只是约定了码点到 emoji 的映射关系,并没有约定 Emoji 图形,每个 Emoji 字体文件可以按照自己的想法设计 Emoji。
二、解决方案
除了一些特殊符号形式的 Emoji,其他Emoji至少有2个字符,所以先根据第二个字符类型判断是否为Emoji,使用Character.UnicodeBlock.of
和Character.getType
方法判定每个字符的类型。
通过第二个字符类型判断当前2个字符为 Emoji 后: 1)判断是否有后续修饰 2)判断处理国旗类型;判断处理肤色修饰;判断处理 Emoji 序列标签;判断处理零宽度连接符;判断处理连续变体、组合标识;按照普通 Emoji 处理;
处理单字符的特殊符号,这一类型内有的属于 Emoji,有的不是,目前全部简单的按照普通 Emoji 处理;
三、完整代码
package com.zpf.tool; import java.util.List; public class EmojiUtil { public static boolean isEmojiNationalFlag(int codePoint) { return codePoint >= 127462 && codePoint <= 127487; } // String str = new String(new int[]{0x1F44B, 0x1F3FD}, 0, 2); public static boolean isEmojiSkinColor(int codePoint) { return codePoint >= 127995 && codePoint <= 127999; } // String str = new String(new int[]{0x1F3F4, 0xE0067, 0xE0062, 0xE0065, 0xE006E, 0xE0067, 0xE007F}, 0, 7); public static boolean isEmojiTagEnd(int codePoint) { return codePoint == 917631; } public static boolean isEmojiTagSpec(int codePoint) { return codePoint >= 917536 && codePoint <= 917630; } public static boolean isEmojiDecorateBlock(Character.UnicodeBlock block) { if (block == null) { return false; } return block.equals(Character.UnicodeBlock.VARIATION_SELECTORS) || block.equals(Character.UnicodeBlock.VARIATION_SELECTORS_SUPPLEMENT) || block.equals(Character.UnicodeBlock.COMBINING_HALF_MARKS) || block.equals(Character.UnicodeBlock.COMBINING_MARKS_FOR_SYMBOLS) || block.equals(Character.UnicodeBlock.COMBINING_DIACRITICAL_MARKS) || block.equals(Character.UnicodeBlock.COMBINING_DIACRITICAL_MARKS_SUPPLEMENT); } public static void pickAllEmoji(CharSequence data, StringBuilder removeResult, List<String> emojiList) { if (removeResult == null && emojiList == null) { return; } if (removeResult != null) { removeResult.delete(0, removeResult.length()); } if (emojiList != null) { emojiList.clear(); } if (data == null || data.length() == 0) { return; } StringBuilder emojiBuilder = new StringBuilder(); int i = 0; int j; Character.UnicodeBlock block; while (i < data.length()) { if (i + 1 < data.length()) { block = Character.UnicodeBlock.of(data.charAt(i + 1)); if (isEmojiDecorateBlock(block) || Character.UnicodeBlock.LOW_SURROGATES.equals(block)) { if (i + 2 >= data.length()) { emojiBuilder.append(data, i, i + 2); break; } j = handleNationalFlag(data, i, emojiBuilder, emojiList); if (i != j) { i = j; continue; } j = handleHumanSkin(data, i, emojiBuilder, emojiList); if (i != j) { i = j; continue; } j = handleTagSequence(data, i, emojiBuilder, emojiList); if (i != j) { i = j; continue; } emojiBuilder.append(data, i, i + 2); i = handleNextChar(data, i + 2, emojiBuilder, emojiList); continue; } } recordEmoji(emojiBuilder, emojiList); int type = Character.getType(data.charAt(i)); if (type == (int) Character.OTHER_SYMBOL) {//特殊符号一律按照Emoji处理 if (emojiList != null) { emojiList.add(String.valueOf(data.charAt(i))); } } else if (removeResult != null) { removeResult.append(data.charAt(i)); } i++; } recordEmoji(emojiBuilder, emojiList); } private static int handleNextChar(CharSequence data, int i, StringBuilder emojiBuilder, List<String> emojiList) { if (i >= data.length()) { return i; } char nextChar = data.charAt(i); if (nextChar == '\u200D') {//零宽度连接符 emojiBuilder.append(nextChar); return i + 1; } int j = i; Character.UnicodeBlock block; while (j < data.length()) { nextChar = data.charAt(j); block = Character.UnicodeBlock.of(nextChar); if (isEmojiDecorateBlock(block)) { emojiBuilder.append(nextChar); j++; } else { break; } } if (i != j) { recordEmoji(emojiBuilder, emojiList); } return j; } private static int handleNationalFlag(CharSequence data, int i, StringBuilder emojiBuilder, List<String> emojiList) { int codePoint = Character.codePointAt(data, i); if (isEmojiNationalFlag(codePoint)) {//处理国旗类型 recordEmoji(emojiBuilder, emojiList);//提交未处理 if (i + 3 < data.length()) { codePoint = Character.codePointAt(data, i + 2); if (isEmojiNationalFlag(codePoint)) { emojiBuilder.append(data, i, i + 4); recordEmoji(emojiBuilder, emojiList); i = i + 4; } } i = i + 2; } return i; } private static int handleHumanSkin(CharSequence data, int i, StringBuilder emojiBuilder, List<String> emojiList) { if (i + 3 >= data.length()) { return i; } int codePoint = Character.codePointAt(data, i + 2); if (isEmojiSkinColor(codePoint)) {//肤色修饰 emojiBuilder.append(data, i, i + 4); recordEmoji(emojiBuilder, emojiList); i = i + 4; } return i; } private static int handleTagSequence(CharSequence data, int i, StringBuilder emojiBuilder, List<String> emojiList) { if (i + 3 >= data.length()) { return i; } int codePoint = Character.codePointAt(data, i + 2); if (isEmojiTagSpec(codePoint)) { emojiBuilder.append(data, i, i + 4); i = i + 4; while (i < data.length()) { codePoint = Character.codePointAt(data, i); if (isEmojiTagSpec(codePoint)) { emojiBuilder.append(data, i, i + 2); i = i + 2; } else if (isEmojiTagEnd(codePoint)) { emojiBuilder.append(data, i, i + 2); recordEmoji(emojiBuilder, emojiList); i = i + 2; break; } else { //error break; } } emojiBuilder.delete(0, emojiBuilder.length()); } else if (isEmojiTagEnd(codePoint)) { emojiBuilder.append(data, i, i + 4); recordEmoji(emojiBuilder, emojiList); i = i + 4; } return i; } private static void recordEmoji(StringBuilder builder, List<String> emojiList) { if (builder != null && builder.length() > 0) { if (emojiList != null) { emojiList.add(builder.toString()); } builder.delete(0, builder.length()); } } }
以上就是使用Java实现查找并移除字符串中的Emoji的详细内容,更多关于Java查找并移除字符串中Emoji的资料请关注脚本之家其它相关文章!